You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/08/04 13:29:42 UTC

[GitHub] [beam] PhilippeMoussalli opened a new pull request, #22587: WIP: Dataframe API ML preprocessing notebook

PhilippeMoussalli opened a new pull request, #22587:
URL: https://github.com/apache/beam/pull/22587

   - PR that implements a notebook to demonstrate the usage of the beam dataframe API as a preprocessing tool for ML training
   
   WIP:
   - [ ] **Find a method to implement the one-hot-encoding for encoding categorical variables:** related to ticket [#22268](https://github.com/apache/beam/issues/22268)
   - [ ]  **Fix bug that returns `ValueError: No producer for ref_PCollection_PCollection_265` when attempting to merge two deferred datasets :** related to ticket [#22267](https://github.com/apache/beam/issues/22267)
   - [ ] Have only one installation script for Beam with the latest implemented functions in the Dataframe API instead of installing from source
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1205503741

   I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430
   
   CC: @KevinGG 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm merged pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
damccorm merged PR #22587:
URL: https://github.com/apache/beam/pull/22587


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r972137927


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   Hm that's odd. This could be a bug. @yeandy do we have a way to apply a categorical DType to a columnn?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1005707902


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "f5386993-14cb-42ee-94ca-8ea006860d3e"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 3
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "e49b4243-107f-4256-9e09-49cc20bf7f56"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-2c0349a9-81c4-473a-9fa1-44c423244858\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2c0349a9-81c4-473a-9fa1-44c423244858')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://developers.google.com/machine-learning/data-prep/transform/transform-categorical) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can also explore use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns like percentile, mean, std etc. "
+      ],
+      "metadata": {
+        "id": "MGAErO0lAYws"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with dataframe.allow_non_parallel_operations():\n",
+        "  beam_df_description = ib.collect(beam_df.describe())\n",
+        "\n",
+        "beam_df_description"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "Befv697VBGM7",
+        "outputId": "d02b7a41-a8a3-4837-cf63-e1fa9e7b011e"
+      },
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a5b31481d153dff1b7ecdd673624949b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-d16cf806-a3e2-46d9-973d-74448570aaa2\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d16cf806-a3e2-46d9-973d-74448570aaa2')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 9
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns need to be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "It can be observed that most of the columns do not have missing values. However, columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove them since they will not be required for training the machine learning model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "Next, we need to normalize the numerical columns before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation (a.k.a [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score)). This improves the performance and trainign stability of the model during training and inference.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rVdSIyCB0spw"
+      },
+      "source": [
+        "# Putting it all together\n",
+        "\n",
+        "Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.\n",
+        "\n",
+        "> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ndaSNond0v8Q",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 498
+        },
+        "outputId": "0155d359-45c9-4345-e1b6-b1881408f049"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "error",
+          "ename": "ValueError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1408061827.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcollect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpreprocessed_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/utils.py\u001b[0m in \u001b[0;36mrun_within_progress_indicator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[0;32mwith\u001b[0m \u001b[0mProgressIndicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Processing... {func.__name__}'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Done.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 277\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001
 b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    279\u001b[0m   \u001b[0;32mreturn\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/interactive_beam.py\u001b[0m in \u001b[0;36mcollect\u001b[0;34m(pcoll, n, duration, include_window_info)\u001b[0m\n\u001b[1;32m    945\u001b[0m         element_type=element_type)\n\u001b[1;32m    946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 947\u001b[0;31m   \u001b[0mrecording\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecording_manager\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpcoll\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_n\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_duration\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mduration\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    948\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    949\u001b[0m   \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0
 ;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/recording_manager.py\u001b[0m in \u001b[0;36mrecord\u001b[0;34m(self, pcolls, max_n, max_duration)\u001b[0m\n\u001b[1;32m    459\u001b[0m       pf.PipelineFragment(\n\u001b[1;32m    460\u001b[0m           \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muncomputed_pcolls\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m           self.user_pipeline.options).run(blocking=is_remote_run)\n\u001b[0m\u001b[1;32m    462\u001b[0m       \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0muser_pipeline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n
 \u001b[1;32m    463\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, display_pipeline_graph, use_cache, blocking)\u001b[0m\n\u001b[1;32m    111\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_force_compute\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    112\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_blocking\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblocking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 113\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdeduce_fragment\u001
 b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    114\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    115\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_skip_display\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpreserved_skip_display\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mdeduce_fragment\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     98\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_runner_api\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 100\u001b[0;31m         self._options)\n\u001b[0m\u001b[1;32m    101\u001b[0m     \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_derived_pipeline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[
 0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    102\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/pipeline.py\u001b[0m in \u001b[0;36mfrom_runner_api\u001b[0;34m(proto, runner, options, return_context)\u001b[0m\n\u001b[1;32m    990\u001b[0m       \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    991\u001b[0m       \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mproducer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 992\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No producer for %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    993\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    994\u001b[0m 
     \u001b[0;31m# Inject PBegin input where necessary.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mValueError\u001b[0m: No producer for ref_PCollection_PCollection_265"
+          ]
+        }
+      ],
+      "source": [
+        "# Specify the location of source csv file to be processed\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)\n",
+        "\n",
+        "# Drop irrelavant columns/columns with missing values\n",
+        "beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "\n",
+        "# Get numerical columns/columns with categorical variables\n",
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))\n",
+        "\n",
+        "# Normalize the numerical variables \n",
+        "beam_df_numerical = beam_df.filter(items=numerical_cols)\n",
+        "beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()\n",
+        "\n",
+        "\n",
+        "# One-hot encode the categorical variables \n",
+        "beam_df_categorical = beam_df.filter(items=categorical_cols)\n",
+        "# ToDo: one hot-encoding step\n",
+        "\n",
+        "# Merge the normalized variables with the one-hot encoded variables\n",
+        "preprocessed_dataset = beam_df_categorical.merge(beam_df_numerical, left_index = True, right_index = True)\n",
+        "\n",
+        "ib.collect(preprocessed_dataset)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xZvJTqa3XKI_"
+      },
+      "source": [
+        "# Part II : Process the full dataset with the Distributed Runner\n",
+        "Now that we've showcased how to build and execute the pipeline locally using the Interactive Runner. It's time to execute our pipeline on our full dataset by switching to a distributed runner. For this example, we will exectue our pipeline on [Dataflow](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "PROJECT_ID = \"<my-gcp-project>\"\n",
+        "REGION = \"us-west1\"\n",

Review Comment:
   easy fix!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] davidcavazos commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
davidcavazos commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1013192618


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,3517 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Goal\n",
+        "The goal of this notebook is to explore a dataset preprocessed it for machine learning model training using the Beam DataFrames API.\n",
+        "\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [

Review Comment:
   I don't think JSON accepts comments, but it could be an `<!-- HTML comment -->` in a Markdown cell.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r948163222


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   I tried this approach and I am still not able to run the `unique()` command event when defining the argument `unique(as_series=True)` as specified [here](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.frames.html#:~:text=unique(as_series%3DFalse,%5Bsource%5D). 
   (Error: `unique()` is not implemented for deferred Dataframes`).
   In any case, I think it would be nicer to wait up on the `get_dummies()` as it will be more intuitive to use for user. I see that you already filed a [ticket](https://github.com/apache/beam/issues/22646) for it and referred to it in the deliverable 3 doc. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] KevinGG commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
KevinGG commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1218237302

   > > > > I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430
   > > > > CC: @KevinGG
   > > > 
   > > > 
   > > > Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219
   > > 
   > > 
   > > Note if we have to do that to unblock this change, it will be blocked until 2.42.0 is out.
   > 
   > Would it be easier to execute the work-around with ‘loc.setitem’? #22267
   
   The work-around is applied to a specific typed composite transform. So the difficulty is the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1302043846

   Implemented latest feedback @TheNeuralBit @davidcavazos :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r964644584


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   @TheNeuralBit indeed you're correct `unique()` can only be applied to `Series`. I was able to get the unique classes of the categorical column and transform it to a `CategoricalDtype`:
   
   ```
   object_class_col= beam_df['object_class']
   
   unique_classes = pd.CategoricalDtype(ib.collect(object_class_col.unique(as_series=True)))
   ```
   This returns 
   
   ```
   CategoricalDtype(categories=['MBA', 'OMB', 'MCA', 'AMO', 'IMB', 'TJN', 'CEN', 'APO',
                     'ATE', 'AST'], ordered=False)
   ```
   
   I have tried implementing the workaround  you suggested for `str.get_dummies()` but still ran into some issues:
   
   ```
   object_class_col.astype(unique_classes).str.get_dummies()
   
   WontImplementError: astype(dtype='category') is not supported because the type of the output column depends on the data. Please use pd.CategoricalDtype with explicit categories instead.
   For more information see https://s.apache.org/dataframe-non-deferred-columns.
   ```
   I ran into the same issue when trying to covert the `object_class_col` to a categorical variable:
   
   `
   object_class_col.astype('category')
   `
   
   Am I still missing something? 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1209999846

   > > I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430
   > > CC: @KevinGG
   > 
   > Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219
   
   Note if we have to do that to unblock this change, it will be blocked until 2.42.0 is out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] rezarokni commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
rezarokni commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r959782275


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"

Review Comment:
   Remove.



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"

Review Comment:
   Describe that as we want to explore the elements within a PCollection we can make use of the ...



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rVdSIyCB0spw"
+      },
+      "source": [
+        "# Putting it all together\n",
+        "\n",
+        "Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.\n",
+        "\n",
+        "> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ndaSNond0v8Q",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 498
+        },
+        "outputId": "0155d359-45c9-4345-e1b6-b1881408f049"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "error",
+          "ename": "ValueError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1408061827.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcollect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpreprocessed_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/utils.py\u001b[0m in \u001b[0;36mrun_within_progress_indicator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[0;32mwith\u001b[0m \u001b[0mProgressIndicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Processing... {func.__name__}'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Done.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 277\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001
 b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    279\u001b[0m   \u001b[0;32mreturn\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/interactive_beam.py\u001b[0m in \u001b[0;36mcollect\u001b[0;34m(pcoll, n, duration, include_window_info)\u001b[0m\n\u001b[1;32m    945\u001b[0m         element_type=element_type)\n\u001b[1;32m    946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 947\u001b[0;31m   \u001b[0mrecording\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecording_manager\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpcoll\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_n\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_duration\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mduration\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    948\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    949\u001b[0m   \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0
 ;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/recording_manager.py\u001b[0m in \u001b[0;36mrecord\u001b[0;34m(self, pcolls, max_n, max_duration)\u001b[0m\n\u001b[1;32m    459\u001b[0m       pf.PipelineFragment(\n\u001b[1;32m    460\u001b[0m           \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muncomputed_pcolls\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m           self.user_pipeline.options).run(blocking=is_remote_run)\n\u001b[0m\u001b[1;32m    462\u001b[0m       \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0muser_pipeline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n
 \u001b[1;32m    463\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, display_pipeline_graph, use_cache, blocking)\u001b[0m\n\u001b[1;32m    111\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_force_compute\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    112\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_blocking\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblocking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 113\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdeduce_fragment\u001
 b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    114\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    115\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_skip_display\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpreserved_skip_display\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mdeduce_fragment\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     98\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_runner_api\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 100\u001b[0;31m         self._options)\n\u001b[0m\u001b[1;32m    101\u001b[0m     \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_derived_pipeline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[
 0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    102\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/pipeline.py\u001b[0m in \u001b[0;36mfrom_runner_api\u001b[0;34m(proto, runner, options, return_context)\u001b[0m\n\u001b[1;32m    990\u001b[0m       \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    991\u001b[0m       \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mproducer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 992\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No producer for %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    993\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    994\u001b[0m 
     \u001b[0;31m# Inject PBegin input where necessary.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mValueError\u001b[0m: No producer for ref_PCollection_PCollection_265"
+          ]
+        }
+      ],
+      "source": [
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)\n",

Review Comment:
   Can the location be a string variable, will help with readability.



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",

Review Comment:
   Test our code interactively, building out the pipeline as we go



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",

Review Comment:
   can be used, or need to be used?



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "

Review Comment:
   Maybe a link ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1209996717

   >  Have only one installation script for Beam with the latest implemented functions in the Dataframe API instead of installing from source
   
   This is just blocked on the 2.41.0 release, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1239181651

   > > I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430
   > > CC: @KevinGG
   > 
   > Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219
   
   Is there any update on this or a potential workaround for merging Deferred dataframes? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1003832395


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   The issue with astype should be resolved now and the fix will be in Beam 2.43.0. Can you try this again?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1005906158


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "f5386993-14cb-42ee-94ca-8ea006860d3e"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 3
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "e49b4243-107f-4256-9e09-49cc20bf7f56"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-2c0349a9-81c4-473a-9fa1-44c423244858\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2c0349a9-81c4-473a-9fa1-44c423244858')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://developers.google.com/machine-learning/data-prep/transform/transform-categorical) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can also explore use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns like percentile, mean, std etc. "
+      ],
+      "metadata": {
+        "id": "MGAErO0lAYws"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with dataframe.allow_non_parallel_operations():\n",
+        "  beam_df_description = ib.collect(beam_df.describe())\n",
+        "\n",
+        "beam_df_description"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "Befv697VBGM7",
+        "outputId": "d02b7a41-a8a3-4837-cf63-e1fa9e7b011e"
+      },
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a5b31481d153dff1b7ecdd673624949b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-d16cf806-a3e2-46d9-973d-74448570aaa2\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d16cf806-a3e2-46d9-973d-74448570aaa2')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 9
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns need to be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "It can be observed that most of the columns do not have missing values. However, columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove them since they will not be required for training the machine learning model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "Next, we need to normalize the numerical columns before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation (a.k.a [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score)). This improves the performance and trainign stability of the model during training and inference.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rVdSIyCB0spw"
+      },
+      "source": [
+        "# Putting it all together\n",
+        "\n",
+        "Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.\n",
+        "\n",
+        "> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ndaSNond0v8Q",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 498
+        },
+        "outputId": "0155d359-45c9-4345-e1b6-b1881408f049"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "error",
+          "ename": "ValueError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1408061827.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcollect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpreprocessed_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/utils.py\u001b[0m in \u001b[0;36mrun_within_progress_indicator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[0;32mwith\u001b[0m \u001b[0mProgressIndicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Processing... {func.__name__}'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Done.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 277\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001
 b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    279\u001b[0m   \u001b[0;32mreturn\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/interactive_beam.py\u001b[0m in \u001b[0;36mcollect\u001b[0;34m(pcoll, n, duration, include_window_info)\u001b[0m\n\u001b[1;32m    945\u001b[0m         element_type=element_type)\n\u001b[1;32m    946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 947\u001b[0;31m   \u001b[0mrecording\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecording_manager\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpcoll\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_n\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_duration\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mduration\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    948\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    949\u001b[0m   \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0
 ;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/recording_manager.py\u001b[0m in \u001b[0;36mrecord\u001b[0;34m(self, pcolls, max_n, max_duration)\u001b[0m\n\u001b[1;32m    459\u001b[0m       pf.PipelineFragment(\n\u001b[1;32m    460\u001b[0m           \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muncomputed_pcolls\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m           self.user_pipeline.options).run(blocking=is_remote_run)\n\u001b[0m\u001b[1;32m    462\u001b[0m       \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0muser_pipeline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n
 \u001b[1;32m    463\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, display_pipeline_graph, use_cache, blocking)\u001b[0m\n\u001b[1;32m    111\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_force_compute\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    112\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_blocking\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblocking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 113\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdeduce_fragment\u001
 b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    114\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    115\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_skip_display\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpreserved_skip_display\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mdeduce_fragment\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     98\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_runner_api\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 100\u001b[0;31m         self._options)\n\u001b[0m\u001b[1;32m    101\u001b[0m     \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_derived_pipeline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[
 0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    102\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/pipeline.py\u001b[0m in \u001b[0;36mfrom_runner_api\u001b[0;34m(proto, runner, options, return_context)\u001b[0m\n\u001b[1;32m    990\u001b[0m       \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    991\u001b[0m       \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mproducer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 992\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No producer for %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    993\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    994\u001b[0m 
     \u001b[0;31m# Inject PBegin input where necessary.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mValueError\u001b[0m: No producer for ref_PCollection_PCollection_265"
+          ]
+        }
+      ],
+      "source": [
+        "# Specify the location of source csv file to be processed\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)\n",
+        "\n",
+        "# Drop irrelavant columns/columns with missing values\n",
+        "beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "\n",
+        "# Get numerical columns/columns with categorical variables\n",
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))\n",
+        "\n",
+        "# Normalize the numerical variables \n",
+        "beam_df_numerical = beam_df.filter(items=numerical_cols)\n",
+        "beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()\n",
+        "\n",
+        "\n",
+        "# One-hot encode the categorical variables \n",
+        "beam_df_categorical = beam_df.filter(items=categorical_cols)\n",
+        "# ToDo: one hot-encoding step\n",
+        "\n",
+        "# Merge the normalized variables with the one-hot encoded variables\n",
+        "preprocessed_dataset = beam_df_categorical.merge(beam_df_numerical, left_index = True, right_index = True)\n",
+        "\n",
+        "ib.collect(preprocessed_dataset)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xZvJTqa3XKI_"
+      },
+      "source": [
+        "# Part II : Process the full dataset with the Distributed Runner\n",
+        "Now that we've showcased how to build and execute the pipeline locally using the Interactive Runner. It's time to execute our pipeline on our full dataset by switching to a distributed runner. For this example, we will exectue our pipeline on [Dataflow](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "PROJECT_ID = \"<my-gcp-project>\"\n",
+        "REGION = \"us-west1\"\n",
+        "TEMP_DIR = \"gs://<my-bucket>/tmp\"\n",
+        "OUTPUT_DIR = \"gs://<my-bucket>/dataframe-result\""
+      ],
+      "metadata": {
+        "id": "dDBYbMEWbL4t"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "> ℹ️ Note that we are now processing the full dataset `full.csv` that containts approximately 1 million rows. We're also writing the results to a `csv` file instead of using `ib.collect()` to materialize the deferred dataframe.\n",
+        "\n",
+        "> ℹ️ The only things we need to change to switch from an interactive runner towards a distributed one are the pipeline options. The rest of the pipeline steps are exactly identical."
+      ],
+      "metadata": {
+        "id": "Qk1GaYoSc9-1"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Specify the location of source csv file to be processed (full dataset)\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/full.csv'\n",
+        "\n",
+        "# Build a new pipeline that will execute on Dataflow.\n",
+        "p = beam.Pipeline(DataflowRunner(),\n",
+        "                  options=beam.options.pipeline_options.PipelineOptions(\n",
+        "                      project=PROJECT_ID,\n",
+        "                      region=REGION,\n",
+        "                      temp_location=TEMP_DIR,\n",
+        "                      # Disable autoscaling for a quicker demo\n",
+        "                      autoscaling_algorithm='NONE',\n",
+        "                      num_workers=10))\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n",
+        "\n",
+        "# Drop irrelavant columns/columns with missing values\n",
+        "beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "\n",
+        "# Get numerical columns/columns with categorical variables\n",
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))\n",
+        "\n",
+        "# Normalize the numerical variables \n",
+        "beam_df_numerical = beam_df.filter(items=numerical_cols)\n",
+        "beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()\n",
+        "\n",
+        "# One-hot encode the categorical variables \n",
+        "beam_df_categorical = beam_df.filter(items=categorical_cols)\n",
+        "# Todo: one hot-encoding step\n",
+        "\n",
+        "# Merge the normalized variables with the one-hot encoded variables  (Optional)\n",
+        "preprocessed_dataset = beam_df_categorical.merge(beam_numerical_cols, left_index = True, right_index = True)\n",
+        "\n",
+        "# Write the pre-processed dataset to csv\n",
+        "preprocessed_dataset.to_csv(os.path.join(OUTPUT_DIR, \"preprocessed_data.csv\"))"
+      ],
+      "metadata": {
+        "id": "1XovR0gKbMlK"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Let's now submit and execute our pipeline."
+      ],
+      "metadata": {
+        "id": "a789u4Yecs_g"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "p.run().wait_until_finish()"
+      ],
+      "metadata": {
+        "id": "pbUlC102bPaZ"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The execution of the pipeline job will take some time until it finishes."
+      ],
+      "metadata": {
+        "id": "dzdqmzKzTOng"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# What's next \n",
+        "\n",
+        "Now that we've seen how we can analyze and preprocess a large-scale dataset with the Beam DataFrames API, we can now train a model on a classification task on our preprocessed dataset.  \n",
+        "\n",
+        "To learn more on how to get started with classifying structured data, refer to:\n",
+        "\n",
+        "*   [Classify structured data with feature columns](https://www.tensorflow.org/tutorials/structured_data/feature_columns)\n",

Review Comment:
   Good to know. I'll remove the reference 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r959797852


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   Did you try `unique` on just the series that you are trying to one-hot encode? Note that even `pd.get_dummies`, the user is going to have to create a `CategoricalDtype` somehow, which will likely utilize `unique`, so we need to get that working regardless.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1218220948

   > > Have only one installation script for Beam with the latest implemented functions in the Dataframe API instead of installing from source
   > 
   > This is just blocked on the 2.41.0 release, right?
   
   I am aware of the particular release date of 2.41.0. I suppose it will depend if we manage to resolve all the friction points before that release. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
damccorm commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1013020478


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,3496 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Goal\n",
+        "The goal of this notebook is to explore a dataset preprocessed it for machine learning model training using the Beam DataFrames API.\n",
+        "\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.43</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "Install latest version"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C",
+        "beam:comment": "TODO(https://github.com/apache/beam/XXXX): Just install 2.43.0 once it's released, [`issue 23276`](https://github.com/apache/beam/issues/23276)  is currently not implemented for Beam 2.42 (required fix for implementing `str.get_dummies()`"

Review Comment:
   ```suggestion
           "beam:comment": "TODO(https://github.com/apache/beam/issues/23961): Just install 2.43.0 once it's released, [`issue 23276`](https://github.com/apache/beam/issues/23276)  is currently not implemented for Beam 2.42 (required fix for implementing `str.get_dummies()`"
   ```
   
   I filed an issue for this, could we reference it directly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1205260641

   @rezarokni @TheNeuralBit 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r973225250


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   Filed #23276 to track this bug.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1005710426


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "f5386993-14cb-42ee-94ca-8ea006860d3e"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 3
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "e49b4243-107f-4256-9e09-49cc20bf7f56"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-2c0349a9-81c4-473a-9fa1-44c423244858\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2c0349a9-81c4-473a-9fa1-44c423244858')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://developers.google.com/machine-learning/data-prep/transform/transform-categorical) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can also explore use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns like percentile, mean, std etc. "
+      ],
+      "metadata": {
+        "id": "MGAErO0lAYws"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with dataframe.allow_non_parallel_operations():\n",
+        "  beam_df_description = ib.collect(beam_df.describe())\n",
+        "\n",
+        "beam_df_description"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "Befv697VBGM7",
+        "outputId": "d02b7a41-a8a3-4837-cf63-e1fa9e7b011e"
+      },
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a5b31481d153dff1b7ecdd673624949b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-d16cf806-a3e2-46d9-973d-74448570aaa2\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d16cf806-a3e2-46d9-973d-74448570aaa2')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 9
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns need to be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "It can be observed that most of the columns do not have missing values. However, columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove them since they will not be required for training the machine learning model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "Next, we need to normalize the numerical columns before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation (a.k.a [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score)). This improves the performance and trainign stability of the model during training and inference.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rVdSIyCB0spw"
+      },
+      "source": [
+        "# Putting it all together\n",
+        "\n",
+        "Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.\n",
+        "\n",
+        "> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ndaSNond0v8Q",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 498
+        },
+        "outputId": "0155d359-45c9-4345-e1b6-b1881408f049"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "error",
+          "ename": "ValueError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1408061827.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcollect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpreprocessed_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/utils.py\u001b[0m in \u001b[0;36mrun_within_progress_indicator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[0;32mwith\u001b[0m \u001b[0mProgressIndicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Processing... {func.__name__}'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Done.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 277\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001
 b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    279\u001b[0m   \u001b[0;32mreturn\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/interactive_beam.py\u001b[0m in \u001b[0;36mcollect\u001b[0;34m(pcoll, n, duration, include_window_info)\u001b[0m\n\u001b[1;32m    945\u001b[0m         element_type=element_type)\n\u001b[1;32m    946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 947\u001b[0;31m   \u001b[0mrecording\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecording_manager\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpcoll\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_n\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_duration\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mduration\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    948\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    949\u001b[0m   \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0
 ;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/recording_manager.py\u001b[0m in \u001b[0;36mrecord\u001b[0;34m(self, pcolls, max_n, max_duration)\u001b[0m\n\u001b[1;32m    459\u001b[0m       pf.PipelineFragment(\n\u001b[1;32m    460\u001b[0m           \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muncomputed_pcolls\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m           self.user_pipeline.options).run(blocking=is_remote_run)\n\u001b[0m\u001b[1;32m    462\u001b[0m       \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0muser_pipeline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n
 \u001b[1;32m    463\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, display_pipeline_graph, use_cache, blocking)\u001b[0m\n\u001b[1;32m    111\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_force_compute\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    112\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_blocking\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblocking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 113\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdeduce_fragment\u001
 b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    114\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    115\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_skip_display\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpreserved_skip_display\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mdeduce_fragment\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     98\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_runner_api\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 100\u001b[0;31m         self._options)\n\u001b[0m\u001b[1;32m    101\u001b[0m     \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_derived_pipeline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[
 0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    102\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/pipeline.py\u001b[0m in \u001b[0;36mfrom_runner_api\u001b[0;34m(proto, runner, options, return_context)\u001b[0m\n\u001b[1;32m    990\u001b[0m       \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    991\u001b[0m       \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mproducer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 992\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No producer for %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    993\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    994\u001b[0m 
     \u001b[0;31m# Inject PBegin input where necessary.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mValueError\u001b[0m: No producer for ref_PCollection_PCollection_265"
+          ]
+        }
+      ],
+      "source": [
+        "# Specify the location of source csv file to be processed\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)\n",
+        "\n",
+        "# Drop irrelavant columns/columns with missing values\n",
+        "beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "\n",
+        "# Get numerical columns/columns with categorical variables\n",
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))\n",
+        "\n",
+        "# Normalize the numerical variables \n",
+        "beam_df_numerical = beam_df.filter(items=numerical_cols)\n",
+        "beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()\n",
+        "\n",
+        "\n",
+        "# One-hot encode the categorical variables \n",
+        "beam_df_categorical = beam_df.filter(items=categorical_cols)\n",
+        "# ToDo: one hot-encoding step\n",
+        "\n",
+        "# Merge the normalized variables with the one-hot encoded variables\n",
+        "preprocessed_dataset = beam_df_categorical.merge(beam_df_numerical, left_index = True, right_index = True)\n",
+        "\n",
+        "ib.collect(preprocessed_dataset)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xZvJTqa3XKI_"
+      },
+      "source": [
+        "# Part II : Process the full dataset with the Distributed Runner\n",
+        "Now that we've showcased how to build and execute the pipeline locally using the Interactive Runner. It's time to execute our pipeline on our full dataset by switching to a distributed runner. For this example, we will exectue our pipeline on [Dataflow](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "PROJECT_ID = \"<my-gcp-project>\"\n",
+        "REGION = \"us-west1\"\n",
+        "TEMP_DIR = \"gs://<my-bucket>/tmp\"\n",
+        "OUTPUT_DIR = \"gs://<my-bucket>/dataframe-result\""
+      ],
+      "metadata": {
+        "id": "dDBYbMEWbL4t"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "> ℹ️ Note that we are now processing the full dataset `full.csv` that containts approximately 1 million rows. We're also writing the results to a `csv` file instead of using `ib.collect()` to materialize the deferred dataframe.\n",
+        "\n",
+        "> ℹ️ The only things we need to change to switch from an interactive runner towards a distributed one are the pipeline options. The rest of the pipeline steps are exactly identical."
+      ],
+      "metadata": {
+        "id": "Qk1GaYoSc9-1"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Specify the location of source csv file to be processed (full dataset)\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/full.csv'\n",
+        "\n",
+        "# Build a new pipeline that will execute on Dataflow.\n",
+        "p = beam.Pipeline(DataflowRunner(),\n",
+        "                  options=beam.options.pipeline_options.PipelineOptions(\n",
+        "                      project=PROJECT_ID,\n",
+        "                      region=REGION,\n",
+        "                      temp_location=TEMP_DIR,\n",
+        "                      # Disable autoscaling for a quicker demo\n",
+        "                      autoscaling_algorithm='NONE',\n",
+        "                      num_workers=10))\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n",
+        "\n",
+        "# Drop irrelavant columns/columns with missing values\n",
+        "beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "\n",
+        "# Get numerical columns/columns with categorical variables\n",
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))\n",
+        "\n",
+        "# Normalize the numerical variables \n",
+        "beam_df_numerical = beam_df.filter(items=numerical_cols)\n",
+        "beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()\n",
+        "\n",
+        "# One-hot encode the categorical variables \n",
+        "beam_df_categorical = beam_df.filter(items=categorical_cols)\n",
+        "# Todo: one hot-encoding step\n",
+        "\n",
+        "# Merge the normalized variables with the one-hot encoded variables  (Optional)\n",
+        "preprocessed_dataset = beam_df_categorical.merge(beam_numerical_cols, left_index = True, right_index = True)\n",
+        "\n",
+        "# Write the pre-processed dataset to csv\n",
+        "preprocessed_dataset.to_csv(os.path.join(OUTPUT_DIR, \"preprocessed_data.csv\"))"
+      ],
+      "metadata": {
+        "id": "1XovR0gKbMlK"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Let's now submit and execute our pipeline."
+      ],
+      "metadata": {
+        "id": "a789u4Yecs_g"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "p.run().wait_until_finish()"
+      ],
+      "metadata": {
+        "id": "pbUlC102bPaZ"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The execution of the pipeline job will take some time until it finishes."
+      ],
+      "metadata": {
+        "id": "dzdqmzKzTOng"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# What's next \n",
+        "\n",
+        "Now that we've seen how we can analyze and preprocess a large-scale dataset with the Beam DataFrames API, we can now train a model on a classification task on our preprocessed dataset.  \n",
+        "\n",
+        "To learn more on how to get started with classifying structured data, refer to:\n",
+        "\n",
+        "*   [Classify structured data with feature columns](https://www.tensorflow.org/tutorials/structured_data/feature_columns)\n",

Review Comment:
   good to know, i'll remove the resource



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r948163222


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   I tried this approach and I am still not able to run the `unique()` command event when defining the argument `unique(as_series=True)` as specified [here](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.frames.html#:~:text=b%0A2%20%20%20%20c-,unique,-(as_series%3DFalse))
   
   **Error**: `unique()` is not implemented for deferred Dataframes`
   
   In any case, I think it would be nicer to wait up on the `get_dummies()` as it will be more intuitive to use for user. I see that you already filed a [ticket](https://github.com/apache/beam/issues/22646) for it and referred to it in the deliverable 3 doc. 
   



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   I tried this approach and I am still not able to run the `unique()` command event when defining the argument `unique(as_series=True)` as specified [here](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.frames.html#:~:text=b%0A2%20%20%20%20c-,unique,-(as_series%3DFalse))
   
   **Error**: `unique() is not implemented for deferred Dataframes`
   
   In any case, I think it would be nicer to wait up on the `get_dummies()` as it will be more intuitive to use for user. I see that you already filed a [ticket](https://github.com/apache/beam/issues/22646) for it and referred to it in the deliverable 3 doc. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1218223350

   > > > I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430
   > > > CC: @KevinGG
   > > 
   > > 
   > > Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219
   > 
   > Note if we have to do that to unblock this change, it will be blocked until 2.42.0 is out.
   
   Would it be easier to execute the work-around with  ‘loc.setitem’?    https://github.com/apache/beam/issues/22267


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] davidcavazos commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
davidcavazos commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r992788022


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "f5386993-14cb-42ee-94ca-8ea006860d3e"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 3
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "e49b4243-107f-4256-9e09-49cc20bf7f56"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-2c0349a9-81c4-473a-9fa1-44c423244858\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2c0349a9-81c4-473a-9fa1-44c423244858')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://developers.google.com/machine-learning/data-prep/transform/transform-categorical) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can also explore use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns like percentile, mean, std etc. "
+      ],
+      "metadata": {
+        "id": "MGAErO0lAYws"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with dataframe.allow_non_parallel_operations():\n",
+        "  beam_df_description = ib.collect(beam_df.describe())\n",
+        "\n",
+        "beam_df_description"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "Befv697VBGM7",
+        "outputId": "d02b7a41-a8a3-4837-cf63-e1fa9e7b011e"
+      },
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a5b31481d153dff1b7ecdd673624949b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-d16cf806-a3e2-46d9-973d-74448570aaa2\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d16cf806-a3e2-46d9-973d-74448570aaa2')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 9
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns need to be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "It can be observed that most of the columns do not have missing values. However, columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove them since they will not be required for training the machine learning model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "Next, we need to normalize the numerical columns before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation (a.k.a [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score)). This improves the performance and trainign stability of the model during training and inference.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rVdSIyCB0spw"
+      },
+      "source": [
+        "# Putting it all together\n",
+        "\n",
+        "Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.\n",
+        "\n",
+        "> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ndaSNond0v8Q",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 498
+        },
+        "outputId": "0155d359-45c9-4345-e1b6-b1881408f049"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "error",
+          "ename": "ValueError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1408061827.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcollect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpreprocessed_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/utils.py\u001b[0m in \u001b[0;36mrun_within_progress_indicator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[0;32mwith\u001b[0m \u001b[0mProgressIndicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Processing... {func.__name__}'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Done.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 277\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001
 b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    279\u001b[0m   \u001b[0;32mreturn\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/interactive_beam.py\u001b[0m in \u001b[0;36mcollect\u001b[0;34m(pcoll, n, duration, include_window_info)\u001b[0m\n\u001b[1;32m    945\u001b[0m         element_type=element_type)\n\u001b[1;32m    946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 947\u001b[0;31m   \u001b[0mrecording\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecording_manager\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpcoll\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_n\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_duration\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mduration\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    948\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    949\u001b[0m   \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0
 ;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/recording_manager.py\u001b[0m in \u001b[0;36mrecord\u001b[0;34m(self, pcolls, max_n, max_duration)\u001b[0m\n\u001b[1;32m    459\u001b[0m       pf.PipelineFragment(\n\u001b[1;32m    460\u001b[0m           \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muncomputed_pcolls\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m           self.user_pipeline.options).run(blocking=is_remote_run)\n\u001b[0m\u001b[1;32m    462\u001b[0m       \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0muser_pipeline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n
 \u001b[1;32m    463\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, display_pipeline_graph, use_cache, blocking)\u001b[0m\n\u001b[1;32m    111\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_force_compute\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    112\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_blocking\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblocking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 113\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdeduce_fragment\u001
 b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    114\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    115\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_skip_display\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpreserved_skip_display\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mdeduce_fragment\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     98\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_runner_api\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 100\u001b[0;31m         self._options)\n\u001b[0m\u001b[1;32m    101\u001b[0m     \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_derived_pipeline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[
 0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    102\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/pipeline.py\u001b[0m in \u001b[0;36mfrom_runner_api\u001b[0;34m(proto, runner, options, return_context)\u001b[0m\n\u001b[1;32m    990\u001b[0m       \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    991\u001b[0m       \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mproducer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 992\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No producer for %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    993\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    994\u001b[0m 
     \u001b[0;31m# Inject PBegin input where necessary.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mValueError\u001b[0m: No producer for ref_PCollection_PCollection_265"
+          ]
+        }
+      ],
+      "source": [
+        "# Specify the location of source csv file to be processed\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)\n",
+        "\n",
+        "# Drop irrelavant columns/columns with missing values\n",
+        "beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "\n",
+        "# Get numerical columns/columns with categorical variables\n",
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))\n",
+        "\n",
+        "# Normalize the numerical variables \n",
+        "beam_df_numerical = beam_df.filter(items=numerical_cols)\n",
+        "beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()\n",
+        "\n",
+        "\n",
+        "# One-hot encode the categorical variables \n",
+        "beam_df_categorical = beam_df.filter(items=categorical_cols)\n",
+        "# ToDo: one hot-encoding step\n",
+        "\n",
+        "# Merge the normalized variables with the one-hot encoded variables\n",
+        "preprocessed_dataset = beam_df_categorical.merge(beam_df_numerical, left_index = True, right_index = True)\n",
+        "\n",
+        "ib.collect(preprocessed_dataset)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xZvJTqa3XKI_"
+      },
+      "source": [
+        "# Part II : Process the full dataset with the Distributed Runner\n",
+        "Now that we've showcased how to build and execute the pipeline locally using the Interactive Runner. It's time to execute our pipeline on our full dataset by switching to a distributed runner. For this example, we will exectue our pipeline on [Dataflow](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "PROJECT_ID = \"<my-gcp-project>\"\n",
+        "REGION = \"us-west1\"\n",
+        "TEMP_DIR = \"gs://<my-bucket>/tmp\"\n",
+        "OUTPUT_DIR = \"gs://<my-bucket>/dataframe-result\""
+      ],
+      "metadata": {
+        "id": "dDBYbMEWbL4t"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "> ℹ️ Note that we are now processing the full dataset `full.csv` that containts approximately 1 million rows. We're also writing the results to a `csv` file instead of using `ib.collect()` to materialize the deferred dataframe.\n",
+        "\n",
+        "> ℹ️ The only things we need to change to switch from an interactive runner towards a distributed one are the pipeline options. The rest of the pipeline steps are exactly identical."
+      ],
+      "metadata": {
+        "id": "Qk1GaYoSc9-1"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Specify the location of source csv file to be processed (full dataset)\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/full.csv'\n",
+        "\n",
+        "# Build a new pipeline that will execute on Dataflow.\n",
+        "p = beam.Pipeline(DataflowRunner(),\n",
+        "                  options=beam.options.pipeline_options.PipelineOptions(\n",
+        "                      project=PROJECT_ID,\n",
+        "                      region=REGION,\n",
+        "                      temp_location=TEMP_DIR,\n",
+        "                      # Disable autoscaling for a quicker demo\n",
+        "                      autoscaling_algorithm='NONE',\n",
+        "                      num_workers=10))\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n",
+        "\n",
+        "# Drop irrelavant columns/columns with missing values\n",
+        "beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "\n",
+        "# Get numerical columns/columns with categorical variables\n",
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))\n",
+        "\n",
+        "# Normalize the numerical variables \n",
+        "beam_df_numerical = beam_df.filter(items=numerical_cols)\n",
+        "beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()\n",
+        "\n",
+        "# One-hot encode the categorical variables \n",
+        "beam_df_categorical = beam_df.filter(items=categorical_cols)\n",
+        "# Todo: one hot-encoding step\n",
+        "\n",
+        "# Merge the normalized variables with the one-hot encoded variables  (Optional)\n",
+        "preprocessed_dataset = beam_df_categorical.merge(beam_numerical_cols, left_index = True, right_index = True)\n",
+        "\n",
+        "# Write the pre-processed dataset to csv\n",
+        "preprocessed_dataset.to_csv(os.path.join(OUTPUT_DIR, \"preprocessed_data.csv\"))"
+      ],
+      "metadata": {
+        "id": "1XovR0gKbMlK"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Let's now submit and execute our pipeline."
+      ],
+      "metadata": {
+        "id": "a789u4Yecs_g"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "p.run().wait_until_finish()"
+      ],
+      "metadata": {
+        "id": "pbUlC102bPaZ"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The execution of the pipeline job will take some time until it finishes."
+      ],
+      "metadata": {
+        "id": "dzdqmzKzTOng"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# What's next \n",
+        "\n",
+        "Now that we've seen how we can analyze and preprocess a large-scale dataset with the Beam DataFrames API, we can now train a model on a classification task on our preprocessed dataset.  \n",
+        "\n",
+        "To learn more on how to get started with classifying structured data, refer to:\n",
+        "\n",
+        "*   [Classify structured data with feature columns](https://www.tensorflow.org/tutorials/structured_data/feature_columns)\n",

Review Comment:
   Feature columns are no longer recommended. They were used for tf.Estimators, but now that TF2 has moved to Keras they now recommend the preprocessing layers.
   
   Also, I noticed that the preprocessing layers already take care of normalizing with the z-score and one hot encodings, but it's still nice seeing how to do that with Beam.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r964470743


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"

Review Comment:
   Good point :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1005704412


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "f5386993-14cb-42ee-94ca-8ea006860d3e"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 3
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "e49b4243-107f-4256-9e09-49cc20bf7f56"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-2c0349a9-81c4-473a-9fa1-44c423244858\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2c0349a9-81c4-473a-9fa1-44c423244858')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://developers.google.com/machine-learning/data-prep/transform/transform-categorical) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can also explore use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns like percentile, mean, std etc. "
+      ],
+      "metadata": {
+        "id": "MGAErO0lAYws"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with dataframe.allow_non_parallel_operations():\n",
+        "  beam_df_description = ib.collect(beam_df.describe())\n",
+        "\n",
+        "beam_df_description"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "Befv697VBGM7",
+        "outputId": "d02b7a41-a8a3-4837-cf63-e1fa9e7b011e"
+      },
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a5b31481d153dff1b7ecdd673624949b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-d16cf806-a3e2-46d9-973d-74448570aaa2\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d16cf806-a3e2-46d9-973d-74448570aaa2')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 9
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns need to be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "It can be observed that most of the columns do not have missing values. However, columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove them since they will not be required for training the machine learning model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "Next, we need to normalize the numerical columns before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation (a.k.a [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score)). This improves the performance and trainign stability of the model during training and inference.\n"

Review Comment:
   woops nice catch ;)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] davidcavazos commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
davidcavazos commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1012106374


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "f5386993-14cb-42ee-94ca-8ea006860d3e"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 3
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "e49b4243-107f-4256-9e09-49cc20bf7f56"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-2c0349a9-81c4-473a-9fa1-44c423244858\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2c0349a9-81c4-473a-9fa1-44c423244858')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://developers.google.com/machine-learning/data-prep/transform/transform-categorical) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can also explore use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns like percentile, mean, std etc. "
+      ],
+      "metadata": {
+        "id": "MGAErO0lAYws"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with dataframe.allow_non_parallel_operations():\n",
+        "  beam_df_description = ib.collect(beam_df.describe())\n",
+        "\n",
+        "beam_df_description"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "Befv697VBGM7",
+        "outputId": "d02b7a41-a8a3-4837-cf63-e1fa9e7b011e"
+      },
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a5b31481d153dff1b7ecdd673624949b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-d16cf806-a3e2-46d9-973d-74448570aaa2\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d16cf806-a3e2-46d9-973d-74448570aaa2')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 9
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns need to be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "It can be observed that most of the columns do not have missing values. However, columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove them since they will not be required for training the machine learning model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "Next, we need to normalize the numerical columns before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation (a.k.a [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score)). This improves the performance and trainign stability of the model during training and inference.\n"

Review Comment:
   Typo: trainign -> training (still present)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1012096504


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,3517 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Goal\n",
+        "The goal of this notebook is to explore a dataset preprocessed it for machine learning model training using the Beam DataFrames API.\n",
+        "\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [

Review Comment:
   It would be nice to add a TODO in here as well, but I wouldn't want it to be visible to the user. I'm not sure if there's a way to do that in the ipynb format.



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,3517 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Goal\n",
+        "The goal of this notebook is to explore a dataset preprocessed it for machine learning model training using the Beam DataFrames API.\n",
+        "\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [

Review Comment:
   Could we file an issue to track removing this once Beam 2.43.0 is released?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] yeandy commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
yeandy commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r972269658


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   [Here](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/dataframe/frames_test.py#L2417) and [here](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/dataframe/frames_test.py#L1486) are some examples of applying `CategoricalDtype`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] KevinGG commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
KevinGG commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1243009835

   @PhilippeMoussalli Could you please take a look at https://github.com/apache/beam/pull/23069?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r948163222


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   I tried this approach and I am still not able to run the `unique()` command event when defining the argument `unique(as_series=True)` as specified [here](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.frames.html#:~:text=unique(as_series%3DFalse,%5Bsource%5D)
   (Error: `unique()` is not implemented for deferred Dataframes`).
   In any case, I think it would be nicer to wait up on the `get_dummies()` as it will be more intuitive to use for user. I see that you already filed a [ticket](https://github.com/apache/beam/issues/22646) for it and referred to it in the deliverable 3 doc. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r948163222


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   I tried this approach and I am still not able to run the `unique()` command event when defining the argument `unique(as_series=True)` as specified [here](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.frames.html#:~:text=b%0A2%20%20%20%20c-,unique,-(as_series%3DFalse)
   
   **Error**: `unique()` is not implemented for deferred Dataframes`).
   In any case, I think it would be nicer to wait up on the `get_dummies()` as it will be more intuitive to use for user. I see that you already filed a [ticket](https://github.com/apache/beam/issues/22646) for it and referred to it in the deliverable 3 doc. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1005728017


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rVdSIyCB0spw"
+      },
+      "source": [
+        "# Putting it all together\n",
+        "\n",
+        "Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.\n",
+        "\n",
+        "> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ndaSNond0v8Q",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 498
+        },
+        "outputId": "0155d359-45c9-4345-e1b6-b1881408f049"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "error",
+          "ename": "ValueError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1408061827.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcollect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpreprocessed_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/utils.py\u001b[0m in \u001b[0;36mrun_within_progress_indicator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[0;32mwith\u001b[0m \u001b[0mProgressIndicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Processing... {func.__name__}'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Done.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 277\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001
 b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    279\u001b[0m   \u001b[0;32mreturn\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/interactive_beam.py\u001b[0m in \u001b[0;36mcollect\u001b[0;34m(pcoll, n, duration, include_window_info)\u001b[0m\n\u001b[1;32m    945\u001b[0m         element_type=element_type)\n\u001b[1;32m    946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 947\u001b[0;31m   \u001b[0mrecording\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecording_manager\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpcoll\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_n\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_duration\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mduration\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    948\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    949\u001b[0m   \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0
 ;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/recording_manager.py\u001b[0m in \u001b[0;36mrecord\u001b[0;34m(self, pcolls, max_n, max_duration)\u001b[0m\n\u001b[1;32m    459\u001b[0m       pf.PipelineFragment(\n\u001b[1;32m    460\u001b[0m           \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muncomputed_pcolls\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m           self.user_pipeline.options).run(blocking=is_remote_run)\n\u001b[0m\u001b[1;32m    462\u001b[0m       \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0muser_pipeline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n
 \u001b[1;32m    463\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, display_pipeline_graph, use_cache, blocking)\u001b[0m\n\u001b[1;32m    111\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_force_compute\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    112\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_blocking\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblocking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 113\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdeduce_fragment\u001
 b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    114\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    115\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_skip_display\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpreserved_skip_display\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mdeduce_fragment\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     98\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_runner_api\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 100\u001b[0;31m         self._options)\n\u001b[0m\u001b[1;32m    101\u001b[0m     \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_derived_pipeline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[
 0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    102\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/pipeline.py\u001b[0m in \u001b[0;36mfrom_runner_api\u001b[0;34m(proto, runner, options, return_context)\u001b[0m\n\u001b[1;32m    990\u001b[0m       \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    991\u001b[0m       \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mproducer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 992\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No producer for %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    993\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    994\u001b[0m 
     \u001b[0;31m# Inject PBegin input where necessary.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mValueError\u001b[0m: No producer for ref_PCollection_PCollection_265"
+          ]
+        }
+      ],
+      "source": [
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)\n",

Review Comment:
   > Thanks! This looks great! I love the asteroids data used for this sample.
   > 
   > It would be nice to include a small paragraph saying what we're trying to do, like "Creating a preprocessed dataset for a machine learning model in Beam DataFrames". I wasn't sure what the goal of the notebook was until almost at the end. It would also be nice to mention it only creates the dataset, not the model.
   > 
   > I also noticed that the Keras preprocessing layers already take care of normalizing and one-hot-encoding data. I tend to recommend this in my own samples because it allows us to pass the model the raw data like we see it in the field, and let the model itself normalize it as part of its architecture. And adding more data to the dataset does not require normalizing the entire thing again, it could be simply adding more files. However, I think it's nice to see how to do that with Beam DataFrames as well. I think it's particularly important to explore and understand the data.
   > 
   > Other than that, I really liked the way you presented and explored the data in a very interactive way.
   
   Thanks for the feedback @davidcavazos :). Indeed the preprocessing steps can be done with Keras. I think the benefit of showing how it can be done in Pandas is to allow for the possibility to train with other frameworks (sklearn, XGboost,...)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1005716369


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",

Review Comment:
   There was another feature that was recently implemented by @TheNeuralBit to implement the one-hot-encoding and that's scheduled for 2.43. I'll update the comment in that section
   https://github.com/apache/beam/issues/23276



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1005728017


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rVdSIyCB0spw"
+      },
+      "source": [
+        "# Putting it all together\n",
+        "\n",
+        "Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.\n",
+        "\n",
+        "> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ndaSNond0v8Q",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 498
+        },
+        "outputId": "0155d359-45c9-4345-e1b6-b1881408f049"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "error",
+          "ename": "ValueError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1408061827.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcollect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpreprocessed_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/utils.py\u001b[0m in \u001b[0;36mrun_within_progress_indicator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[0;32mwith\u001b[0m \u001b[0mProgressIndicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Processing... {func.__name__}'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Done.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 277\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001
 b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    279\u001b[0m   \u001b[0;32mreturn\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/interactive_beam.py\u001b[0m in \u001b[0;36mcollect\u001b[0;34m(pcoll, n, duration, include_window_info)\u001b[0m\n\u001b[1;32m    945\u001b[0m         element_type=element_type)\n\u001b[1;32m    946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 947\u001b[0;31m   \u001b[0mrecording\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecording_manager\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpcoll\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_n\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_duration\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mduration\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    948\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    949\u001b[0m   \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0
 ;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/recording_manager.py\u001b[0m in \u001b[0;36mrecord\u001b[0;34m(self, pcolls, max_n, max_duration)\u001b[0m\n\u001b[1;32m    459\u001b[0m       pf.PipelineFragment(\n\u001b[1;32m    460\u001b[0m           \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muncomputed_pcolls\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m           self.user_pipeline.options).run(blocking=is_remote_run)\n\u001b[0m\u001b[1;32m    462\u001b[0m       \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0muser_pipeline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n
 \u001b[1;32m    463\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, display_pipeline_graph, use_cache, blocking)\u001b[0m\n\u001b[1;32m    111\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_force_compute\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    112\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_blocking\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblocking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 113\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdeduce_fragment\u001
 b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    114\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    115\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_skip_display\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpreserved_skip_display\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mdeduce_fragment\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     98\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_runner_api\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 100\u001b[0;31m         self._options)\n\u001b[0m\u001b[1;32m    101\u001b[0m     \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_derived_pipeline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[
 0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    102\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/pipeline.py\u001b[0m in \u001b[0;36mfrom_runner_api\u001b[0;34m(proto, runner, options, return_context)\u001b[0m\n\u001b[1;32m    990\u001b[0m       \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    991\u001b[0m       \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mproducer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 992\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No producer for %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    993\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    994\u001b[0m 
     \u001b[0;31m# Inject PBegin input where necessary.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mValueError\u001b[0m: No producer for ref_PCollection_PCollection_265"
+          ]
+        }
+      ],
+      "source": [
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)\n",

Review Comment:
   > Thanks! This looks great! I love the asteroids data used for this sample.
   > 
   > It would be nice to include a small paragraph saying what we're trying to do, like "Creating a preprocessed dataset for a machine learning model in Beam DataFrames". I wasn't sure what the goal of the notebook was until almost at the end. It would also be nice to mention it only creates the dataset, not the model.
   > 
   > I also noticed that the Keras preprocessing layers already take care of normalizing and one-hot-encoding data. I tend to recommend this in my own samples because it allows us to pass the model the raw data like we see it in the field, and let the model itself normalize it as part of its architecture. And adding more data to the dataset does not require normalizing the entire thing again, it could be simply adding more files. However, I think it's nice to see how to do that with Beam DataFrames as well. I think it's particularly important to explore and understand the data.
   > 
   > Other than that, I really liked the way you presented and explored the data in a very interactive way.
   
   Thanks for the feedback @davidcavazos :). Indeed the preprocessing steps can be done with Keras. I think the ebenfit of showing how it can be done in Pandas is to allow for the possibility to train with other frameworks (sklearn, XGboost,...)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r948163222


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   I tried this approach and I am still not able to run the `unique()` command event when defining the argument `unique(as_series=True)` as specified [here](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.frames.html#:~:text=b%0A2%20%20%20%20c-,unique,-(as_series%3DFalse))
   
   **Error**: `unique()` is not implemented for deferred Dataframes`).
   In any case, I think it would be nicer to wait up on the `get_dummies()` as it will be more intuitive to use for user. I see that you already filed a [ticket](https://github.com/apache/beam/issues/22646) for it and referred to it in the deliverable 3 doc. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r948156320


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rVdSIyCB0spw"
+      },
+      "source": [
+        "# Putting it all together\n",
+        "\n",
+        "Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.\n",
+        "\n",
+        "> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ndaSNond0v8Q",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 498
+        },
+        "outputId": "0155d359-45c9-4345-e1b6-b1881408f049"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "error",
+          "ename": "ValueError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1408061827.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcollect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpreprocessed_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/utils.py\u001b[0m in \u001b[0;36mrun_within_progress_indicator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[0;32mwith\u001b[0m \u001b[0mProgressIndicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Processing... {func.__name__}'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Done.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 277\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001
 b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    279\u001b[0m   \u001b[0;32mreturn\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/interactive_beam.py\u001b[0m in \u001b[0;36mcollect\u001b[0;34m(pcoll, n, duration, include_window_info)\u001b[0m\n\u001b[1;32m    945\u001b[0m         element_type=element_type)\n\u001b[1;32m    946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 947\u001b[0;31m   \u001b[0mrecording\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecording_manager\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpcoll\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_n\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_duration\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mduration\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    948\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    949\u001b[0m   \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0
 ;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/recording_manager.py\u001b[0m in \u001b[0;36mrecord\u001b[0;34m(self, pcolls, max_n, max_duration)\u001b[0m\n\u001b[1;32m    459\u001b[0m       pf.PipelineFragment(\n\u001b[1;32m    460\u001b[0m           \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muncomputed_pcolls\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m           self.user_pipeline.options).run(blocking=is_remote_run)\n\u001b[0m\u001b[1;32m    462\u001b[0m       \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0muser_pipeline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n
 \u001b[1;32m    463\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, display_pipeline_graph, use_cache, blocking)\u001b[0m\n\u001b[1;32m    111\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_force_compute\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    112\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_blocking\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblocking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 113\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdeduce_fragment\u001
 b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    114\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    115\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_skip_display\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpreserved_skip_display\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mdeduce_fragment\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     98\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_runner_api\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 100\u001b[0;31m         self._options)\n\u001b[0m\u001b[1;32m    101\u001b[0m     \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_derived_pipeline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[
 0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    102\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/pipeline.py\u001b[0m in \u001b[0;36mfrom_runner_api\u001b[0;34m(proto, runner, options, return_context)\u001b[0m\n\u001b[1;32m    990\u001b[0m       \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    991\u001b[0m       \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mproducer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 992\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No producer for %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    993\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    994\u001b[0m 
     \u001b[0;31m# Inject PBegin input where necessary.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mValueError\u001b[0m: No producer for ref_PCollection_PCollection_265"
+          ]
+        }
+      ],
+      "source": [
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)\n",
+        "\n",
+        "# Drop irrelavant columns/columns with missing values\n",
+        "beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "\n",
+        "# Get numerical columns/columns with categorical variables\n",
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))\n",
+        "\n",
+        "# Normalize the numerical variables \n",
+        "beam_df_numerical = beam_df.filter(items=numerical_cols)\n",
+        "beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()\n",
+        "\n",
+        "\n",
+        "# One-hot encode the categorical variables \n",
+        "beam_df_categorical = beam_df.filter(items=categorical_cols)\n",
+        "# ToDo: one hot-encoding step\n",
+        "\n",
+        "# Merge the normalized variables with the one-hot encoded variables\n",
+        "preprocessed_dataset = beam_df_categorical.merge(beam_df_numerical, left_index = True, right_index = True)\n",
+        "\n",
+        "ib.collect(preprocessed_dataset)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xZvJTqa3XKI_"
+      },
+      "source": [
+        "# Part II : Process the full dataset with the Distributed Runner\n",
+        "Now that we've showcased how to build and execute the pipeline locally using the Interactive Runner. It's time to execute our pipeline on our full dataset by switching to a distributed runner. For this example, we will exectue our pipeline on [Dataflow](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "PROJECT_ID = \"<my-gcp-project>\"\n",
+        "REGION = \"us-west1\"\n",
+        "TEMP_DIR = \"gs://<my-bucket>/tmp\"\n",
+        "OUTPUT_DIR = \"gs://<my-bucket>/dataframe-result\""
+      ],
+      "metadata": {
+        "id": "dDBYbMEWbL4t"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "> ℹ️ Note that we are now processing the full dataset `sample.csv` that containts approximately 1 million rows. We're also writing the results to a `csv` file instead of using `ib.collect()` to materialize the deferred dataframe.\n",

Review Comment:
   good remark ;), indeed I think `full.csv` makes more sense. I'll change it accordingly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1013038675


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,3496 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Goal\n",
+        "The goal of this notebook is to explore a dataset preprocessed it for machine learning model training using the Beam DataFrames API.\n",
+        "\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.43</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "Install latest version"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C",
+        "beam:comment": "TODO(https://github.com/apache/beam/XXXX): Just install 2.43.0 once it's released, [`issue 23276`](https://github.com/apache/beam/issues/23276)  is currently not implemented for Beam 2.42 (required fix for implementing `str.get_dummies()`"

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] davidcavazos commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
davidcavazos commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1012101465


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,3517 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Goal\n",
+        "The goal of this notebook is to explore a dataset preprocessed it for machine learning model training using the Beam DataFrames API.\n",
+        "\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[26/10/2022]:** [`issue 23276`](https://github.com/apache/beam/issues/23276)  is currently not implemented for Beam 2.42 (required fix for implementing `str.get_dummies()`"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 26,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 27,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "3dfba30d-165e-46a6-b0b9-f12519db1c27"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 27
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 28,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "3f89126d-f6fb-43fc-d87b-5daf8563e057"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_79206f341d7de09f6cacdd05be309575\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_79206f341d7de09f6cacdd05be309575\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_79206f341d7de09f6cacdd05be309575\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-064477d9-d4b6-44a6-a8fd-31f0cad2dcb5\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-064477d9-d4b6-44a6-a8fd-31f0cad2dcb5')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-064477d9-d4b6-44a6-a8fd-31f0cad2dcb5 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-064477d9-d4b6-44a6-a8fd-31f0cad2dcb5');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 28
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://developers.google.com/machine-learning/data-prep/transform/transform-categorical) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can also explore use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns like percentile, mean, std, etc. "
+      ],
+      "metadata": {
+        "id": "MGAErO0lAYws"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with dataframe.allow_non_parallel_operations():\n",
+        "  beam_df_description = ib.collect(beam_df.describe())\n",
+        "\n",
+        "beam_df_description"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "Befv697VBGM7",
+        "outputId": "bb465020-94e4-4b3c-fda6-6e43da199be1"
+      },
+      "execution_count": 21,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_98687cb0060a8077a8abab6e464e4a75\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_98687cb0060a8077a8abab6e464e4a75\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_98687cb0060a8077a8abab6e464e4a75\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-238f6456-dbfb-4707-a725-c607c847f522\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-238f6456-dbfb-4707-a725-c607c847f522')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-238f6456-dbfb-4707-a725-c607c847f522 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-238f6456-dbfb-4707-a725-c607c847f522');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 21
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns need to be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spk_id'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 29,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 30,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 358
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "14a4ac64-5b54-4ed4-959d-daea65bb6457"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_868f8ad001ab00c7013b65472a513917\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_868f8ad001ab00c7013b65472a513917\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_868f8ad001ab00c7013b65472a513917\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 30
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "It can be observed that most of the columns do not have missing values. However, columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove them since they will not be required for training the machine learning model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 31,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "3be686d0-f56a-4054-a71a-d3019bf379e8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_f88b77f183371d1a45fa87bed4a545f6\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_f88b77f183371d1a45fa87bed4a545f6\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_f88b77f183371d1a45fa87bed4a545f6\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-4e05872f-1261-4bd8-9251-a967afaa1b32\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-4e05872f-1261-4bd8-9251-a967afaa1b32')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-4e05872f-1261-4bd8-9251-a967afaa1b32 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-4e05872f-1261-4bd8-9251-a967afaa1b32');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 31
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "Next, we need to normalize the numerical columns before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation (a.k.a [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score)). This improves the performance and trainign stability of the model during training and inference.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 32,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 33,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 587
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "16fede03-f67e-4c26-8714-fd3fc6892109"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_55302fa5950ce6ceb9f99ff9a168097a\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_55302fa5950ce6ceb9f99ff9a168097a\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_55302fa5950ce6ceb9f99ff9a168097a\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f727aadc-8e99-4a1b-b999-93e65a3d6d02\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f727aadc-8e99-4a1b-b999-93e65a3d6d02')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f727aadc-8e99-4a1b-b999-93e65a3d6d02 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f727aadc-8e99-4a1b-b999-93e65a3d6d02');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 33
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_df_numericals = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_df_numericals = (beam_df_numericals - beam_df_numericals.mean())/beam_df_numericals.std()\n",
+        "\n",
+        "ib.collect(beam_df_numericals)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"

Review Comment:
   one-hot encod*ing* variables -> one-hot encod*ed* variables (?)



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "f5386993-14cb-42ee-94ca-8ea006860d3e"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 3
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "e49b4243-107f-4256-9e09-49cc20bf7f56"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-2c0349a9-81c4-473a-9fa1-44c423244858\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2c0349a9-81c4-473a-9fa1-44c423244858')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://developers.google.com/machine-learning/data-prep/transform/transform-categorical) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can also explore use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns like percentile, mean, std etc. "
+      ],
+      "metadata": {
+        "id": "MGAErO0lAYws"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with dataframe.allow_non_parallel_operations():\n",
+        "  beam_df_description = ib.collect(beam_df.describe())\n",
+        "\n",
+        "beam_df_description"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "Befv697VBGM7",
+        "outputId": "d02b7a41-a8a3-4837-cf63-e1fa9e7b011e"
+      },
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a5b31481d153dff1b7ecdd673624949b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-d16cf806-a3e2-46d9-973d-74448570aaa2\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d16cf806-a3e2-46d9-973d-74448570aaa2')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 9
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns need to be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "It can be observed that most of the columns do not have missing values. However, columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove them since they will not be required for training the machine learning model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "Next, we need to normalize the numerical columns before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation (a.k.a [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score)). This improves the performance and trainign stability of the model during training and inference.\n"

Review Comment:
   It looks like this typo is still here :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1012105546


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,3517 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Goal\n",
+        "The goal of this notebook is to explore a dataset preprocessed it for machine learning model training using the Beam DataFrames API.\n",
+        "\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [

Review Comment:
   It looks like, in theory, we could stick this in a custom [metadata field](https://nbformat.readthedocs.io/en/latest/format_description.html#metadata), something like:
   ```
         "metadata": {
           "id": "-OJC0Xn5Um-C",
           "beam:comment": "TODO(https://github.com/apache/beam/XXXX): Just install 2.43.0 once it's released",
         },
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r959801581


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   To be clear, I'm suggesting you change `object_class_col= beam_df.filter(items=['object_class'])` to `object_class_col= beam_df['object_class']`
   
   When you select the column with `filter` it's creating a DataFrame with a single column, not a series.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r973221740


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   Ah but both of those examples apply the categorical type with astype on a pandas Series, outside of the test framework. It seems we don't verify it with DeferredSeries.astype



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r972161089


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   Interesting apparently instances of CategoricalDtype are considered equal to the string "category" : https://github.com/pandas-dev/pandas/blob/54347fe684e0f7844bf407b1fb958a5269646825/pandas/core/dtypes/dtypes.py#L366
   
   The aim in our check was to avoid the case where users indicate `astype("category")` and rely on pandas to resolve the categories, since we need explicit categories. We should be able to find another way to check this, but it will have to be another bugfix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r941893216


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   `str` is an attribute on Series, you need to access just a Series (e.g. `object_class_col = beam_df.object_class`) in order to use `str.get_dummies`. Selecting a single column with `filter` produces a single column DataFrame.
   
   You will also need to configure that column to have a `CategoricalDtype`. This is where you can use `unique` and `ib.collect`, something like:
   ```
   unique_classes = ib.collect(object_class_col.unique())
   object_class_col.astype(pd.CategoricalDtype(unique_classes)).str.get_dummies()
   ```



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rVdSIyCB0spw"
+      },
+      "source": [
+        "# Putting it all together\n",
+        "\n",
+        "Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.\n",
+        "\n",
+        "> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ndaSNond0v8Q",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 498
+        },
+        "outputId": "0155d359-45c9-4345-e1b6-b1881408f049"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "error",
+          "ename": "ValueError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1408061827.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcollect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpreprocessed_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/utils.py\u001b[0m in \u001b[0;36mrun_within_progress_indicator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[0;32mwith\u001b[0m \u001b[0mProgressIndicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Processing... {func.__name__}'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Done.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 277\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001
 b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    279\u001b[0m   \u001b[0;32mreturn\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/interactive_beam.py\u001b[0m in \u001b[0;36mcollect\u001b[0;34m(pcoll, n, duration, include_window_info)\u001b[0m\n\u001b[1;32m    945\u001b[0m         element_type=element_type)\n\u001b[1;32m    946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 947\u001b[0;31m   \u001b[0mrecording\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecording_manager\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpcoll\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_n\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_duration\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mduration\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    948\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    949\u001b[0m   \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0
 ;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/recording_manager.py\u001b[0m in \u001b[0;36mrecord\u001b[0;34m(self, pcolls, max_n, max_duration)\u001b[0m\n\u001b[1;32m    459\u001b[0m       pf.PipelineFragment(\n\u001b[1;32m    460\u001b[0m           \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muncomputed_pcolls\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m           self.user_pipeline.options).run(blocking=is_remote_run)\n\u001b[0m\u001b[1;32m    462\u001b[0m       \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0muser_pipeline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n
 \u001b[1;32m    463\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, display_pipeline_graph, use_cache, blocking)\u001b[0m\n\u001b[1;32m    111\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_force_compute\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    112\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_blocking\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblocking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 113\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdeduce_fragment\u001
 b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    114\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    115\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_skip_display\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpreserved_skip_display\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mdeduce_fragment\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     98\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_runner_api\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 100\u001b[0;31m         self._options)\n\u001b[0m\u001b[1;32m    101\u001b[0m     \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_derived_pipeline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[
 0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    102\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/pipeline.py\u001b[0m in \u001b[0;36mfrom_runner_api\u001b[0;34m(proto, runner, options, return_context)\u001b[0m\n\u001b[1;32m    990\u001b[0m       \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    991\u001b[0m       \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mproducer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 992\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No producer for %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    993\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    994\u001b[0m 
     \u001b[0;31m# Inject PBegin input where necessary.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mValueError\u001b[0m: No producer for ref_PCollection_PCollection_265"
+          ]
+        }
+      ],
+      "source": [
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)\n",
+        "\n",
+        "# Drop irrelavant columns/columns with missing values\n",
+        "beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "\n",
+        "# Get numerical columns/columns with categorical variables\n",
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))\n",
+        "\n",
+        "# Normalize the numerical variables \n",
+        "beam_df_numerical = beam_df.filter(items=numerical_cols)\n",
+        "beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()\n",
+        "\n",
+        "\n",
+        "# One-hot encode the categorical variables \n",
+        "beam_df_categorical = beam_df.filter(items=categorical_cols)\n",
+        "# ToDo: one hot-encoding step\n",
+        "\n",
+        "# Merge the normalized variables with the one-hot encoded variables\n",
+        "preprocessed_dataset = beam_df_categorical.merge(beam_df_numerical, left_index = True, right_index = True)\n",
+        "\n",
+        "ib.collect(preprocessed_dataset)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xZvJTqa3XKI_"
+      },
+      "source": [
+        "# Part II : Process the full dataset with the Distributed Runner\n",
+        "Now that we've showcased how to build and execute the pipeline locally using the Interactive Runner. It's time to execute our pipeline on our full dataset by switching to a distributed runner. For this example, we will exectue our pipeline on [Dataflow](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "PROJECT_ID = \"<my-gcp-project>\"\n",
+        "REGION = \"us-west1\"\n",
+        "TEMP_DIR = \"gs://<my-bucket>/tmp\"\n",
+        "OUTPUT_DIR = \"gs://<my-bucket>/dataframe-result\""
+      ],
+      "metadata": {
+        "id": "dDBYbMEWbL4t"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "> ℹ️ Note that we are now processing the full dataset `sample.csv` that containts approximately 1 million rows. We're also writing the results to a `csv` file instead of using `ib.collect()` to materialize the deferred dataframe.\n",

Review Comment:
   Should `sample.csv` be renamed `full.csv`? The name sample makes me think it's still a subset of the full dataset.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1209999468

   > release
   
   Note if we have to do that to unblock this change it will be blocked until 2.42.0 is out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] KevinGG commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
KevinGG commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1239789553

   > > > I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430
   > > > CC: @KevinGG
   > > 
   > > 
   > > Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219
   > 
   > Is there any update on this or a potential workaround for merging Deferred dataframes?
   
   Just sent out https://github.com/apache/beam/pull/23069, this should mitigate the unintended pruning issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r964647615


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "

Review Comment:
   I Included a few also in the reference for general guidelines on data transformation 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
damccorm commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1302233498

   Run Website PreCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] davidcavazos commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
davidcavazos commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r992766337


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n"

Review Comment:
   Can we briefly explain why we're setting `splittable=True`, especially since it's a difference from `pandas`.



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",

Review Comment:
   Beam 2.41 is out, so can we only keep the `pip install` option?



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "f5386993-14cb-42ee-94ca-8ea006860d3e"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 3
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "e49b4243-107f-4256-9e09-49cc20bf7f56"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-2c0349a9-81c4-473a-9fa1-44c423244858\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2c0349a9-81c4-473a-9fa1-44c423244858')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://developers.google.com/machine-learning/data-prep/transform/transform-categorical) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can also explore use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns like percentile, mean, std etc. "
+      ],
+      "metadata": {
+        "id": "MGAErO0lAYws"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with dataframe.allow_non_parallel_operations():\n",
+        "  beam_df_description = ib.collect(beam_df.describe())\n",
+        "\n",
+        "beam_df_description"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "Befv697VBGM7",
+        "outputId": "d02b7a41-a8a3-4837-cf63-e1fa9e7b011e"
+      },
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a5b31481d153dff1b7ecdd673624949b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-d16cf806-a3e2-46d9-973d-74448570aaa2\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d16cf806-a3e2-46d9-973d-74448570aaa2')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 9
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns need to be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "It can be observed that most of the columns do not have missing values. However, columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove them since they will not be required for training the machine learning model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "Next, we need to normalize the numerical columns before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation (a.k.a [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score)). This improves the performance and trainign stability of the model during training and inference.\n"

Review Comment:
   Typo: `trainign` -> `training`



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "f5386993-14cb-42ee-94ca-8ea006860d3e"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 3
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "e49b4243-107f-4256-9e09-49cc20bf7f56"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-2c0349a9-81c4-473a-9fa1-44c423244858\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2c0349a9-81c4-473a-9fa1-44c423244858')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://developers.google.com/machine-learning/data-prep/transform/transform-categorical) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can also explore use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns like percentile, mean, std etc. "
+      ],
+      "metadata": {
+        "id": "MGAErO0lAYws"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with dataframe.allow_non_parallel_operations():\n",
+        "  beam_df_description = ib.collect(beam_df.describe())\n",
+        "\n",
+        "beam_df_description"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "Befv697VBGM7",
+        "outputId": "d02b7a41-a8a3-4837-cf63-e1fa9e7b011e"
+      },
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a5b31481d153dff1b7ecdd673624949b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-d16cf806-a3e2-46d9-973d-74448570aaa2\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d16cf806-a3e2-46d9-973d-74448570aaa2')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 9
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns need to be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."

Review Comment:
   Typo: `spkid` -> `spk_id`



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rVdSIyCB0spw"
+      },
+      "source": [
+        "# Putting it all together\n",
+        "\n",
+        "Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.\n",
+        "\n",
+        "> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ndaSNond0v8Q",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 498
+        },
+        "outputId": "0155d359-45c9-4345-e1b6-b1881408f049"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "error",
+          "ename": "ValueError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1408061827.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcollect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpreprocessed_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/utils.py\u001b[0m in \u001b[0;36mrun_within_progress_indicator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[0;32mwith\u001b[0m \u001b[0mProgressIndicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Processing... {func.__name__}'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Done.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 277\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001
 b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    279\u001b[0m   \u001b[0;32mreturn\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/interactive_beam.py\u001b[0m in \u001b[0;36mcollect\u001b[0;34m(pcoll, n, duration, include_window_info)\u001b[0m\n\u001b[1;32m    945\u001b[0m         element_type=element_type)\n\u001b[1;32m    946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 947\u001b[0;31m   \u001b[0mrecording\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecording_manager\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpcoll\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_n\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_duration\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mduration\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    948\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    949\u001b[0m   \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0
 ;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/recording_manager.py\u001b[0m in \u001b[0;36mrecord\u001b[0;34m(self, pcolls, max_n, max_duration)\u001b[0m\n\u001b[1;32m    459\u001b[0m       pf.PipelineFragment(\n\u001b[1;32m    460\u001b[0m           \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muncomputed_pcolls\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m           self.user_pipeline.options).run(blocking=is_remote_run)\n\u001b[0m\u001b[1;32m    462\u001b[0m       \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0muser_pipeline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n
 \u001b[1;32m    463\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, display_pipeline_graph, use_cache, blocking)\u001b[0m\n\u001b[1;32m    111\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_force_compute\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    112\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_blocking\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblocking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 113\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdeduce_fragment\u001
 b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    114\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    115\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_skip_display\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpreserved_skip_display\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mdeduce_fragment\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     98\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_runner_api\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 100\u001b[0;31m         self._options)\n\u001b[0m\u001b[1;32m    101\u001b[0m     \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_derived_pipeline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[
 0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    102\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/pipeline.py\u001b[0m in \u001b[0;36mfrom_runner_api\u001b[0;34m(proto, runner, options, return_context)\u001b[0m\n\u001b[1;32m    990\u001b[0m       \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    991\u001b[0m       \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mproducer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 992\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No producer for %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    993\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    994\u001b[0m 
     \u001b[0;31m# Inject PBegin input where necessary.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mValueError\u001b[0m: No producer for ref_PCollection_PCollection_265"
+          ]
+        }
+      ],
+      "source": [
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)\n",

Review Comment:
   Is this where `source_csv_file` should go?



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "f5386993-14cb-42ee-94ca-8ea006860d3e"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 3
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "e49b4243-107f-4256-9e09-49cc20bf7f56"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-2c0349a9-81c4-473a-9fa1-44c423244858\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2c0349a9-81c4-473a-9fa1-44c423244858')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://developers.google.com/machine-learning/data-prep/transform/transform-categorical) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can also explore use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns like percentile, mean, std etc. "
+      ],
+      "metadata": {
+        "id": "MGAErO0lAYws"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with dataframe.allow_non_parallel_operations():\n",
+        "  beam_df_description = ib.collect(beam_df.describe())\n",
+        "\n",
+        "beam_df_description"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "Befv697VBGM7",
+        "outputId": "d02b7a41-a8a3-4837-cf63-e1fa9e7b011e"
+      },
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a5b31481d153dff1b7ecdd673624949b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-d16cf806-a3e2-46d9-973d-74448570aaa2\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d16cf806-a3e2-46d9-973d-74448570aaa2')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 9
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns need to be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "It can be observed that most of the columns do not have missing values. However, columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove them since they will not be required for training the machine learning model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "Next, we need to normalize the numerical columns before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation (a.k.a [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score)). This improves the performance and trainign stability of the model during training and inference.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rVdSIyCB0spw"
+      },
+      "source": [
+        "# Putting it all together\n",
+        "\n",
+        "Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.\n",
+        "\n",
+        "> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ndaSNond0v8Q",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 498
+        },
+        "outputId": "0155d359-45c9-4345-e1b6-b1881408f049"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "error",
+          "ename": "ValueError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1408061827.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcollect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpreprocessed_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/utils.py\u001b[0m in \u001b[0;36mrun_within_progress_indicator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[0;32mwith\u001b[0m \u001b[0mProgressIndicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Processing... {func.__name__}'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Done.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 277\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001
 b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    279\u001b[0m   \u001b[0;32mreturn\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/interactive_beam.py\u001b[0m in \u001b[0;36mcollect\u001b[0;34m(pcoll, n, duration, include_window_info)\u001b[0m\n\u001b[1;32m    945\u001b[0m         element_type=element_type)\n\u001b[1;32m    946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 947\u001b[0;31m   \u001b[0mrecording\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecording_manager\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpcoll\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_n\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_duration\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mduration\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    948\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    949\u001b[0m   \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0
 ;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/recording_manager.py\u001b[0m in \u001b[0;36mrecord\u001b[0;34m(self, pcolls, max_n, max_duration)\u001b[0m\n\u001b[1;32m    459\u001b[0m       pf.PipelineFragment(\n\u001b[1;32m    460\u001b[0m           \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muncomputed_pcolls\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m           self.user_pipeline.options).run(blocking=is_remote_run)\n\u001b[0m\u001b[1;32m    462\u001b[0m       \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0muser_pipeline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n
 \u001b[1;32m    463\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, display_pipeline_graph, use_cache, blocking)\u001b[0m\n\u001b[1;32m    111\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_force_compute\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    112\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_blocking\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblocking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 113\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdeduce_fragment\u001
 b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    114\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    115\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_skip_display\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpreserved_skip_display\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mdeduce_fragment\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     98\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_runner_api\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 100\u001b[0;31m         self._options)\n\u001b[0m\u001b[1;32m    101\u001b[0m     \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_derived_pipeline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[
 0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    102\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/pipeline.py\u001b[0m in \u001b[0;36mfrom_runner_api\u001b[0;34m(proto, runner, options, return_context)\u001b[0m\n\u001b[1;32m    990\u001b[0m       \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    991\u001b[0m       \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mproducer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 992\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No producer for %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    993\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    994\u001b[0m 
     \u001b[0;31m# Inject PBegin input where necessary.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mValueError\u001b[0m: No producer for ref_PCollection_PCollection_265"
+          ]
+        }
+      ],
+      "source": [
+        "# Specify the location of source csv file to be processed\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)\n",
+        "\n",
+        "# Drop irrelavant columns/columns with missing values\n",
+        "beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "\n",
+        "# Get numerical columns/columns with categorical variables\n",
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))\n",
+        "\n",
+        "# Normalize the numerical variables \n",
+        "beam_df_numerical = beam_df.filter(items=numerical_cols)\n",
+        "beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()\n",
+        "\n",
+        "\n",
+        "# One-hot encode the categorical variables \n",
+        "beam_df_categorical = beam_df.filter(items=categorical_cols)\n",
+        "# ToDo: one hot-encoding step\n",
+        "\n",
+        "# Merge the normalized variables with the one-hot encoded variables\n",
+        "preprocessed_dataset = beam_df_categorical.merge(beam_df_numerical, left_index = True, right_index = True)\n",
+        "\n",
+        "ib.collect(preprocessed_dataset)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xZvJTqa3XKI_"
+      },
+      "source": [
+        "# Part II : Process the full dataset with the Distributed Runner\n",
+        "Now that we've showcased how to build and execute the pipeline locally using the Interactive Runner. It's time to execute our pipeline on our full dataset by switching to a distributed runner. For this example, we will exectue our pipeline on [Dataflow](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "PROJECT_ID = \"<my-gcp-project>\"\n",
+        "REGION = \"us-west1\"\n",

Review Comment:
   Nit: can we default to `us-central1`? Its Carbon Free Energy percentage is higher and might have a more average latency for the east coast.
   
   https://cloud.google.com/sustainability/region-carbon



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "f5386993-14cb-42ee-94ca-8ea006860d3e"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 3
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "e49b4243-107f-4256-9e09-49cc20bf7f56"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-2c0349a9-81c4-473a-9fa1-44c423244858\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2c0349a9-81c4-473a-9fa1-44c423244858')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://developers.google.com/machine-learning/data-prep/transform/transform-categorical) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can also explore use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns like percentile, mean, std etc. "
+      ],
+      "metadata": {
+        "id": "MGAErO0lAYws"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with dataframe.allow_non_parallel_operations():\n",
+        "  beam_df_description = ib.collect(beam_df.describe())\n",
+        "\n",
+        "beam_df_description"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "Befv697VBGM7",
+        "outputId": "d02b7a41-a8a3-4837-cf63-e1fa9e7b011e"
+      },
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a5b31481d153dff1b7ecdd673624949b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-d16cf806-a3e2-46d9-973d-74448570aaa2\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d16cf806-a3e2-46d9-973d-74448570aaa2')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 9
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns need to be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "It can be observed that most of the columns do not have missing values. However, columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove them since they will not be required for training the machine learning model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "Next, we need to normalize the numerical columns before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation (a.k.a [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score)). This improves the performance and trainign stability of the model during training and inference.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rVdSIyCB0spw"
+      },
+      "source": [
+        "# Putting it all together\n",
+        "\n",
+        "Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.\n",
+        "\n",
+        "> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ndaSNond0v8Q",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 498
+        },
+        "outputId": "0155d359-45c9-4345-e1b6-b1881408f049"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "error",
+          "ename": "ValueError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1408061827.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcollect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpreprocessed_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/utils.py\u001b[0m in \u001b[0;36mrun_within_progress_indicator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[0;32mwith\u001b[0m \u001b[0mProgressIndicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Processing... {func.__name__}'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Done.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 277\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001
 b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    279\u001b[0m   \u001b[0;32mreturn\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/interactive_beam.py\u001b[0m in \u001b[0;36mcollect\u001b[0;34m(pcoll, n, duration, include_window_info)\u001b[0m\n\u001b[1;32m    945\u001b[0m         element_type=element_type)\n\u001b[1;32m    946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 947\u001b[0;31m   \u001b[0mrecording\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecording_manager\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpcoll\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_n\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_duration\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mduration\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    948\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    949\u001b[0m   \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0
 ;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/recording_manager.py\u001b[0m in \u001b[0;36mrecord\u001b[0;34m(self, pcolls, max_n, max_duration)\u001b[0m\n\u001b[1;32m    459\u001b[0m       pf.PipelineFragment(\n\u001b[1;32m    460\u001b[0m           \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muncomputed_pcolls\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m           self.user_pipeline.options).run(blocking=is_remote_run)\n\u001b[0m\u001b[1;32m    462\u001b[0m       \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0muser_pipeline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n
 \u001b[1;32m    463\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, display_pipeline_graph, use_cache, blocking)\u001b[0m\n\u001b[1;32m    111\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_force_compute\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    112\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_blocking\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblocking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 113\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdeduce_fragment\u001
 b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    114\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    115\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_skip_display\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpreserved_skip_display\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mdeduce_fragment\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     98\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_runner_api\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 100\u001b[0;31m         self._options)\n\u001b[0m\u001b[1;32m    101\u001b[0m     \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_derived_pipeline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[
 0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    102\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/pipeline.py\u001b[0m in \u001b[0;36mfrom_runner_api\u001b[0;34m(proto, runner, options, return_context)\u001b[0m\n\u001b[1;32m    990\u001b[0m       \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    991\u001b[0m       \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mproducer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 992\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No producer for %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    993\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    994\u001b[0m 
     \u001b[0;31m# Inject PBegin input where necessary.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mValueError\u001b[0m: No producer for ref_PCollection_PCollection_265"
+          ]
+        }
+      ],
+      "source": [
+        "# Specify the location of source csv file to be processed\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)\n",
+        "\n",
+        "# Drop irrelavant columns/columns with missing values\n",
+        "beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "\n",
+        "# Get numerical columns/columns with categorical variables\n",
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))\n",
+        "\n",
+        "# Normalize the numerical variables \n",
+        "beam_df_numerical = beam_df.filter(items=numerical_cols)\n",
+        "beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()\n",
+        "\n",
+        "\n",
+        "# One-hot encode the categorical variables \n",
+        "beam_df_categorical = beam_df.filter(items=categorical_cols)\n",
+        "# ToDo: one hot-encoding step\n",
+        "\n",
+        "# Merge the normalized variables with the one-hot encoded variables\n",
+        "preprocessed_dataset = beam_df_categorical.merge(beam_df_numerical, left_index = True, right_index = True)\n",
+        "\n",
+        "ib.collect(preprocessed_dataset)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xZvJTqa3XKI_"
+      },
+      "source": [
+        "# Part II : Process the full dataset with the Distributed Runner\n",
+        "Now that we've showcased how to build and execute the pipeline locally using the Interactive Runner. It's time to execute our pipeline on our full dataset by switching to a distributed runner. For this example, we will exectue our pipeline on [Dataflow](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "PROJECT_ID = \"<my-gcp-project>\"\n",
+        "REGION = \"us-west1\"\n",
+        "TEMP_DIR = \"gs://<my-bucket>/tmp\"\n",
+        "OUTPUT_DIR = \"gs://<my-bucket>/dataframe-result\""
+      ],
+      "metadata": {
+        "id": "dDBYbMEWbL4t"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "> ℹ️ Note that we are now processing the full dataset `full.csv` that containts approximately 1 million rows. We're also writing the results to a `csv` file instead of using `ib.collect()` to materialize the deferred dataframe.\n",
+        "\n",
+        "> ℹ️ The only things we need to change to switch from an interactive runner towards a distributed one are the pipeline options. The rest of the pipeline steps are exactly identical."
+      ],
+      "metadata": {
+        "id": "Qk1GaYoSc9-1"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Specify the location of source csv file to be processed (full dataset)\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/full.csv'\n",
+        "\n",
+        "# Build a new pipeline that will execute on Dataflow.\n",
+        "p = beam.Pipeline(DataflowRunner(),\n",
+        "                  options=beam.options.pipeline_options.PipelineOptions(\n",
+        "                      project=PROJECT_ID,\n",
+        "                      region=REGION,\n",
+        "                      temp_location=TEMP_DIR,\n",
+        "                      # Disable autoscaling for a quicker demo\n",
+        "                      autoscaling_algorithm='NONE',\n",
+        "                      num_workers=10))\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n",
+        "\n",
+        "# Drop irrelavant columns/columns with missing values\n",
+        "beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "\n",
+        "# Get numerical columns/columns with categorical variables\n",
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))\n",
+        "\n",
+        "# Normalize the numerical variables \n",
+        "beam_df_numerical = beam_df.filter(items=numerical_cols)\n",
+        "beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()\n",
+        "\n",
+        "# One-hot encode the categorical variables \n",
+        "beam_df_categorical = beam_df.filter(items=categorical_cols)\n",
+        "# Todo: one hot-encoding step\n",
+        "\n",
+        "# Merge the normalized variables with the one-hot encoded variables  (Optional)\n",
+        "preprocessed_dataset = beam_df_categorical.merge(beam_numerical_cols, left_index = True, right_index = True)\n",
+        "\n",
+        "# Write the pre-processed dataset to csv\n",
+        "preprocessed_dataset.to_csv(os.path.join(OUTPUT_DIR, \"preprocessed_data.csv\"))"
+      ],
+      "metadata": {
+        "id": "1XovR0gKbMlK"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Let's now submit and execute our pipeline."
+      ],
+      "metadata": {
+        "id": "a789u4Yecs_g"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "p.run().wait_until_finish()"
+      ],
+      "metadata": {
+        "id": "pbUlC102bPaZ"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The execution of the pipeline job will take some time until it finishes."
+      ],
+      "metadata": {
+        "id": "dzdqmzKzTOng"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# What's next \n",
+        "\n",
+        "Now that we've seen how we can analyze and preprocess a large-scale dataset with the Beam DataFrames API, we can now train a model on a classification task on our preprocessed dataset.  \n",
+        "\n",
+        "To learn more on how to get started with classifying structured data, refer to:\n",
+        "\n",
+        "*   [Classify structured data with feature columns](https://www.tensorflow.org/tutorials/structured_data/feature_columns)\n",

Review Comment:
   Feature columns are no longer recommended. They were used for tf.Estimators, but now that TF2 has moved to Keras they now recommend the preprocessing layers.
   
   The preprocessing layers already take care of normalizing with the z-score and one hot encodings, but it's still nice seeing how to do that with Beam.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1005710426


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "f5386993-14cb-42ee-94ca-8ea006860d3e"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 3
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "e49b4243-107f-4256-9e09-49cc20bf7f56"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_f6af66571c53daa0d9052370b7d1d8b7\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-2c0349a9-81c4-473a-9fa1-44c423244858\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2c0349a9-81c4-473a-9fa1-44c423244858')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-2c0349a9-81c4-473a-9fa1-44c423244858');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://developers.google.com/machine-learning/data-prep/transform/transform-categorical) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can also explore use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns like percentile, mean, std etc. "
+      ],
+      "metadata": {
+        "id": "MGAErO0lAYws"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with dataframe.allow_non_parallel_operations():\n",
+        "  beam_df_description = ib.collect(beam_df.describe())\n",
+        "\n",
+        "beam_df_description"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 378
+        },
+        "id": "Befv697VBGM7",
+        "outputId": "d02b7a41-a8a3-4837-cf63-e1fa9e7b011e"
+      },
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a5b31481d153dff1b7ecdd673624949b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a5b31481d153dff1b7ecdd673624949b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "             spk_id  absolute_magnitude     diameter       albedo  \\\n",
+              "count  9.999000e+03         9999.000000  8688.000000  8672.000000   \n",
+              "mean   2.005000e+06           12.675380    19.245446     0.197723   \n",
+              "std    2.886607e+03            1.639609    30.190191     0.138819   \n",
+              "min    2.000001e+06            3.000000     0.300000     0.008000   \n",
+              "25%    2.002500e+06           11.900000     5.614000     0.074000   \n",
+              "50%    2.005000e+06           12.900000     9.814000     0.187000   \n",
+              "75%    2.007500e+06           13.700000    19.156750     0.283000   \n",
+              "max    2.009999e+06           20.700000   939.400000     1.000000   \n",
+              "\n",
+              "       diameter_sigma  eccentricity  inclination      moid_ld  \\\n",
+              "count     8591.000000   9999.000000  9999.000000  9999.000000   \n",
+              "mean         0.454072      0.148716     7.890742   509.805237   \n",
+              "std          1.093676      0.083803     6.336244   205.046582   \n",
+              "min          0.006000      0.001003     0.042716     0.131028   \n",
+              "25%          0.120000      0.093780     3.220137   377.829197   \n",
+              "50%          0.201000      0.140335     6.018836   470.650523   \n",
+              "75%          0.375000      0.187092    10.918176   636.010802   \n",
+              "max         39.297000      0.889831    68.018875  4241.524913   \n",
+              "\n",
+              "       semi_major_axis_au_unit  \n",
+              "count              9999.000000  \n",
+              "mean                  2.689836  \n",
+              "std                   0.607190  \n",
+              "min                   0.832048  \n",
+              "25%                   2.340816  \n",
+              "50%                   2.614468  \n",
+              "75%                   3.005449  \n",
+              "max                  24.667968  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-d16cf806-a3e2-46d9-973d-74448570aaa2\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>9.999000e+03</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>8688.000000</td>\n",
+              "      <td>8672.000000</td>\n",
+              "      <td>8591.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "      <td>9999.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.675380</td>\n",
+              "      <td>19.245446</td>\n",
+              "      <td>0.197723</td>\n",
+              "      <td>0.454072</td>\n",
+              "      <td>0.148716</td>\n",
+              "      <td>7.890742</td>\n",
+              "      <td>509.805237</td>\n",
+              "      <td>2.689836</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>2.886607e+03</td>\n",
+              "      <td>1.639609</td>\n",
+              "      <td>30.190191</td>\n",
+              "      <td>0.138819</td>\n",
+              "      <td>1.093676</td>\n",
+              "      <td>0.083803</td>\n",
+              "      <td>6.336244</td>\n",
+              "      <td>205.046582</td>\n",
+              "      <td>0.607190</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>2.000001e+06</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>0.300000</td>\n",
+              "      <td>0.008000</td>\n",
+              "      <td>0.006000</td>\n",
+              "      <td>0.001003</td>\n",
+              "      <td>0.042716</td>\n",
+              "      <td>0.131028</td>\n",
+              "      <td>0.832048</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>2.002500e+06</td>\n",
+              "      <td>11.900000</td>\n",
+              "      <td>5.614000</td>\n",
+              "      <td>0.074000</td>\n",
+              "      <td>0.120000</td>\n",
+              "      <td>0.093780</td>\n",
+              "      <td>3.220137</td>\n",
+              "      <td>377.829197</td>\n",
+              "      <td>2.340816</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>2.005000e+06</td>\n",
+              "      <td>12.900000</td>\n",
+              "      <td>9.814000</td>\n",
+              "      <td>0.187000</td>\n",
+              "      <td>0.201000</td>\n",
+              "      <td>0.140335</td>\n",
+              "      <td>6.018836</td>\n",
+              "      <td>470.650523</td>\n",
+              "      <td>2.614468</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>2.007500e+06</td>\n",
+              "      <td>13.700000</td>\n",
+              "      <td>19.156750</td>\n",
+              "      <td>0.283000</td>\n",
+              "      <td>0.375000</td>\n",
+              "      <td>0.187092</td>\n",
+              "      <td>10.918176</td>\n",
+              "      <td>636.010802</td>\n",
+              "      <td>3.005449</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>2.009999e+06</td>\n",
+              "      <td>20.700000</td>\n",
+              "      <td>939.400000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>39.297000</td>\n",
+              "      <td>0.889831</td>\n",
+              "      <td>68.018875</td>\n",
+              "      <td>4241.524913</td>\n",
+              "      <td>24.667968</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d16cf806-a3e2-46d9-973d-74448570aaa2')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-d16cf806-a3e2-46d9-973d-74448570aaa2');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 9
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns need to be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "It can be observed that most of the columns do not have missing values. However, columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove them since they will not be required for training the machine learning model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "Next, we need to normalize the numerical columns before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation (a.k.a [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score)). This improves the performance and trainign stability of the model during training and inference.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rVdSIyCB0spw"
+      },
+      "source": [
+        "# Putting it all together\n",
+        "\n",
+        "Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.\n",
+        "\n",
+        "> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ndaSNond0v8Q",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 498
+        },
+        "outputId": "0155d359-45c9-4345-e1b6-b1881408f049"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "error",
+          "ename": "ValueError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1408061827.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcollect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpreprocessed_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/utils.py\u001b[0m in \u001b[0;36mrun_within_progress_indicator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[0;32mwith\u001b[0m \u001b[0mProgressIndicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Processing... {func.__name__}'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Done.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 277\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001
 b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    279\u001b[0m   \u001b[0;32mreturn\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/interactive_beam.py\u001b[0m in \u001b[0;36mcollect\u001b[0;34m(pcoll, n, duration, include_window_info)\u001b[0m\n\u001b[1;32m    945\u001b[0m         element_type=element_type)\n\u001b[1;32m    946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 947\u001b[0;31m   \u001b[0mrecording\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecording_manager\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpcoll\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_n\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_duration\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mduration\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    948\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    949\u001b[0m   \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0
 ;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/recording_manager.py\u001b[0m in \u001b[0;36mrecord\u001b[0;34m(self, pcolls, max_n, max_duration)\u001b[0m\n\u001b[1;32m    459\u001b[0m       pf.PipelineFragment(\n\u001b[1;32m    460\u001b[0m           \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muncomputed_pcolls\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m           self.user_pipeline.options).run(blocking=is_remote_run)\n\u001b[0m\u001b[1;32m    462\u001b[0m       \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0muser_pipeline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n
 \u001b[1;32m    463\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, display_pipeline_graph, use_cache, blocking)\u001b[0m\n\u001b[1;32m    111\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_force_compute\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    112\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_blocking\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblocking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 113\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdeduce_fragment\u001
 b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    114\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    115\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_skip_display\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpreserved_skip_display\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mdeduce_fragment\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     98\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_runner_api\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 100\u001b[0;31m         self._options)\n\u001b[0m\u001b[1;32m    101\u001b[0m     \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_derived_pipeline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[
 0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    102\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/pipeline.py\u001b[0m in \u001b[0;36mfrom_runner_api\u001b[0;34m(proto, runner, options, return_context)\u001b[0m\n\u001b[1;32m    990\u001b[0m       \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    991\u001b[0m       \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mproducer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 992\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No producer for %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    993\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    994\u001b[0m 
     \u001b[0;31m# Inject PBegin input where necessary.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mValueError\u001b[0m: No producer for ref_PCollection_PCollection_265"
+          ]
+        }
+      ],
+      "source": [
+        "# Specify the location of source csv file to be processed\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)\n",
+        "\n",
+        "# Drop irrelavant columns/columns with missing values\n",
+        "beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "\n",
+        "# Get numerical columns/columns with categorical variables\n",
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))\n",
+        "\n",
+        "# Normalize the numerical variables \n",
+        "beam_df_numerical = beam_df.filter(items=numerical_cols)\n",
+        "beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()\n",
+        "\n",
+        "\n",
+        "# One-hot encode the categorical variables \n",
+        "beam_df_categorical = beam_df.filter(items=categorical_cols)\n",
+        "# ToDo: one hot-encoding step\n",
+        "\n",
+        "# Merge the normalized variables with the one-hot encoded variables\n",
+        "preprocessed_dataset = beam_df_categorical.merge(beam_df_numerical, left_index = True, right_index = True)\n",
+        "\n",
+        "ib.collect(preprocessed_dataset)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xZvJTqa3XKI_"
+      },
+      "source": [
+        "# Part II : Process the full dataset with the Distributed Runner\n",
+        "Now that we've showcased how to build and execute the pipeline locally using the Interactive Runner. It's time to execute our pipeline on our full dataset by switching to a distributed runner. For this example, we will exectue our pipeline on [Dataflow](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "PROJECT_ID = \"<my-gcp-project>\"\n",
+        "REGION = \"us-west1\"\n",
+        "TEMP_DIR = \"gs://<my-bucket>/tmp\"\n",
+        "OUTPUT_DIR = \"gs://<my-bucket>/dataframe-result\""
+      ],
+      "metadata": {
+        "id": "dDBYbMEWbL4t"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "> ℹ️ Note that we are now processing the full dataset `full.csv` that containts approximately 1 million rows. We're also writing the results to a `csv` file instead of using `ib.collect()` to materialize the deferred dataframe.\n",
+        "\n",
+        "> ℹ️ The only things we need to change to switch from an interactive runner towards a distributed one are the pipeline options. The rest of the pipeline steps are exactly identical."
+      ],
+      "metadata": {
+        "id": "Qk1GaYoSc9-1"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Specify the location of source csv file to be processed (full dataset)\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/full.csv'\n",
+        "\n",
+        "# Build a new pipeline that will execute on Dataflow.\n",
+        "p = beam.Pipeline(DataflowRunner(),\n",
+        "                  options=beam.options.pipeline_options.PipelineOptions(\n",
+        "                      project=PROJECT_ID,\n",
+        "                      region=REGION,\n",
+        "                      temp_location=TEMP_DIR,\n",
+        "                      # Disable autoscaling for a quicker demo\n",
+        "                      autoscaling_algorithm='NONE',\n",
+        "                      num_workers=10))\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n",
+        "\n",
+        "# Drop irrelavant columns/columns with missing values\n",
+        "beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "\n",
+        "# Get numerical columns/columns with categorical variables\n",
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))\n",
+        "\n",
+        "# Normalize the numerical variables \n",
+        "beam_df_numerical = beam_df.filter(items=numerical_cols)\n",
+        "beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()\n",
+        "\n",
+        "# One-hot encode the categorical variables \n",
+        "beam_df_categorical = beam_df.filter(items=categorical_cols)\n",
+        "# Todo: one hot-encoding step\n",
+        "\n",
+        "# Merge the normalized variables with the one-hot encoded variables  (Optional)\n",
+        "preprocessed_dataset = beam_df_categorical.merge(beam_numerical_cols, left_index = True, right_index = True)\n",
+        "\n",
+        "# Write the pre-processed dataset to csv\n",
+        "preprocessed_dataset.to_csv(os.path.join(OUTPUT_DIR, \"preprocessed_data.csv\"))"
+      ],
+      "metadata": {
+        "id": "1XovR0gKbMlK"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Let's now submit and execute our pipeline."
+      ],
+      "metadata": {
+        "id": "a789u4Yecs_g"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "p.run().wait_until_finish()"
+      ],
+      "metadata": {
+        "id": "pbUlC102bPaZ"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The execution of the pipeline job will take some time until it finishes."
+      ],
+      "metadata": {
+        "id": "dzdqmzKzTOng"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# What's next \n",
+        "\n",
+        "Now that we've seen how we can analyze and preprocess a large-scale dataset with the Beam DataFrames API, we can now train a model on a classification task on our preprocessed dataset.  \n",
+        "\n",
+        "To learn more on how to get started with classifying structured data, refer to:\n",
+        "\n",
+        "*   [Classify structured data with feature columns](https://www.tensorflow.org/tutorials/structured_data/feature_columns)\n",

Review Comment:
   good to know, i'll remove the resource



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1246471292

   > @PhilippeMoussalli Could you please take a look at #23069?
   
   @KevinGG I just tested it out and it checks out! Thanks again for taking this up. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] KevinGG commented on pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
KevinGG commented on PR #22587:
URL: https://github.com/apache/beam/pull/22587#issuecomment-1206694956

   > I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430
   > 
   > CC: @KevinGG
   
   Commented in https://github.com/apache/beam/issues/21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r959797852


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   Did you try `unique` on just the series that you are trying to one-hot encode? Note that even for `pd.get_dummies`, the user is going to have to create a `CategoricalDtype` somehow, which will likely utilize `unique`, so we need to get that working regardless.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r959818228


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   Also it looks like the pandas `DataFrame` doesn't actually have a `unique` operation. If the Beam Deferred DataFrame is giving "unique() is not implemented for deferred Dataframe" then that's a bug. It should give an `AttributeError` like pandas does.
   
   To be sure, I just checked and we do give the same error as pandas:
   ```
   In [7]: df.unique
   ---------------------------------------------------------------------------
   AttributeError                            Traceback (most recent call last)
   Input In [7], in <cell line: 1>()
   ----> 1 beam_df.unique
   
   File ~/working_dir/beam/sdks/python/apache_beam/dataframe/frames.py:2484, in DeferredDataFrame.__getattr__(self, name)
      2482   return self[name]
      2483 else:
   -> 2484   return object.__getattribute__(self, name)
   
   AttributeError: 'DeferredDataFrame' object has no attribute 'unique'
   ```
   
   If a user wants the unique values in a `DataFrame` they can use `drop_duplicates`, as suggested in https://stackoverflow.com/questions/43184491/df-unique-on-whole-dataframe-based-on-a-column.
   
   `drop_duplicates` (with `keep='any'`) should work in Beam.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r959827941


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   For reference, https://github.com/apache/beam/issues/20959 is the relevant issue for making CategoricalDtypes needed in `get_dummies`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r972137927


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"

Review Comment:
   Hm that's odd. This could be a bug. @yeandy do we have a way to apply a categorical DType to a columnn?
   
   I will play around with this some to see if I can find a way. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1005905353


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,2163 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to test our code interactively, building out the pipeline as we go before deploying it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam import dataframe\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, full] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "\n",
+        "source_csv_file = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(source_csv_file, splittable=True)\n"

Review Comment:
   I just read more about the `splittable` param and it turns out it's not needed for this example. i'll leave it out 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r1012845410


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,3517 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "id": "sARMhsXz8yR1",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Goal\n",
+        "The goal of this notebook is to explore a dataset preprocessed it for machine learning model training using the Beam DataFrames API.\n",
+        "\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "As we want to explore the elements within a `PCollection`, we can make use of the the Interactive runner by installing Apache Beam with the `interactive` component. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [

Review Comment:
   done! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r964473694


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",

Review Comment:
   woops need to be used indeed :) nice catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] PhilippeMoussalli commented on a diff in pull request #22587: WIP: Dataframe API ML preprocessing notebook

Posted by GitBox <gi...@apache.org>.
PhilippeMoussalli commented on code in PR #22587:
URL: https://github.com/apache/beam/pull/22587#discussion_r964646141


##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -0,0 +1,1907 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Overview\n",
+        "\n",
+        "One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. \n",
+        "\n",
+        "Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.\n",
+        "\n",
+        "\n",
+        "## Beam DataFrames\n",
+        "\n",
+        "\n",
+        "Beam DataFrames provide a pandas-like DataFrame\n",
+        "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n",
+        "\n",
+        "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+        "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
+        "\n",
+        "## Tutorial outline\n",
+        "\n",
+        "In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  \n",
+        "\n",
+        "*   Removing unwanted columns.\n",
+        "*   One-hot encoding categorical columns.\n",
+        "*   Normalizing numerical columns.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "iFZC1inKuUCy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Installation\n",
+        "\n",
+        "First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.\n"
+      ],
+      "metadata": {
+        "id": "A0f2HJ22D4lt"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pCjwrwNWnuqI"
+      },
+      "source": [
+        "**Option 1:** Install latest version with implemented df.mean()\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-OJC0Xn5Um-C"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/apache/beam.git\n",
+        "\n",
+        "!cd beam/sdks/python && pip3 install -r build-requirements.txt \n",
+        "\n",
+        "%pip install -e beam/sdks/python/.[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xfXzNzA1n3ZP"
+      },
+      "source": [
+        "**Option 2:** Install latest release version   \n",
+        "\n",
+        "**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)\n",
+        "\n",
+        "TODO: Remove this text later"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4xY7ECJZOuJj"
+      },
+      "outputs": [],
+      "source": [
+        "! pip install apache-beam[interactive,gcp]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part I : Local exploration with the Interactive Beam runner\n",
+        "We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.\n",
+        "This allows us to quickly test our pipeline locally before running it on a distributed runner. \n",
+        "\n",
+        "\n",
+        "> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.\n"
+      ],
+      "metadata": {
+        "id": "3NO6RgB7GkkE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5I3G094hoB1P"
+      },
+      "source": [
+        "# Loading the data\n",
+        "\n",
+        "Pandas has the\n",
+        "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+        "function to easily read CSV files into DataFrames.\n",
+        "We're using the beam\n",
+        "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+        "function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "X3_OB9cAULav"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "import numpy as np\n",
+        "import pandas as pd \n",
+        "import apache_beam as beam\n",
+        "import apache_beam.runners.interactive.interactive_beam as ib\n",
+        "from apache_beam.runners.interactive.interactive_runner import InteractiveRunner\n",
+        "from apache_beam.runners.dataflow import DataflowRunner\n",
+        "\n",
+        "# Available options: [sample_1000, sample_10000, sample_100000, sample] where\n",
+        "# sample contains all of the dataset (around 1000000 samples)\n",
+        "file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'\n",
+        "\n",
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "paf7yf3YpCh8"
+      },
+      "source": [
+        "# Data pre-processing\n",
+        "\n",
+        "## Dataset description \n",
+        "\n",
+        "### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)\n",
+        "There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. \n",
+        "\n",
+        "These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Let's first inspect the columns of our dataset and their types"
+      ],
+      "metadata": {
+        "id": "cvAu5T0ENjuQ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LwW77ixE-pjR",
+        "outputId": "c24ff83d-3a13-47a6-c9c2-3978729fde82"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "spk_id                       int64\n",
+              "full_name                   object\n",
+              "near_earth_object           object\n",
+              "absolute_magnitude         float64\n",
+              "diameter                   float64\n",
+              "albedo                     float64\n",
+              "diameter_sigma             float64\n",
+              "eccentricity               float64\n",
+              "inclination                float64\n",
+              "moid_ld                    float64\n",
+              "object_class                object\n",
+              "semi_major_axis_au_unit    float64\n",
+              "hazardous_flag              object\n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "beam_df.dtypes"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame."
+      ],
+      "metadata": {
+        "id": "1Wa6fpbyQige"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 746
+        },
+        "id": "DPxkAmkpq4Xv",
+        "outputId": "14fa80de-2dee-4963-99d8-3e321f949ff8"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_a986c6cc61ed5a5b622e163d92f73775\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_a986c6cc61ed5a5b622e163d92f73775\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "       spk_id                   full_name near_earth_object  \\\n",
+              "0     2000001                     1 Ceres                 N   \n",
+              "1     2000002                    2 Pallas                 N   \n",
+              "2     2000003                      3 Juno                 N   \n",
+              "3     2000004                     4 Vesta                 N   \n",
+              "4     2000005                   5 Astraea                 N   \n",
+              "...       ...                         ...               ...   \n",
+              "9994  2009995    9995 Alouette (4805 P-L)                 N   \n",
+              "9995  2009996         9996 ANS (9070 P-L)                 N   \n",
+              "9996  2009997        9997 COBE (1217 T-1)                 N   \n",
+              "9997  2009998         9998 ISO (1293 T-1)                 N   \n",
+              "9998  2009999       9999 Wiles (4196 T-2)                 N   \n",
+              "\n",
+              "      absolute_magnitude  diameter  albedo  diameter_sigma  eccentricity  \\\n",
+              "0                   3.40   939.400  0.0900           0.200      0.076009   \n",
+              "1                   4.20   545.000  0.1010          18.000      0.229972   \n",
+              "2                   5.33   246.596  0.2140          10.594      0.256936   \n",
+              "3                   3.00   525.400  0.4228           0.200      0.088721   \n",
+              "4                   6.90   106.699  0.2740           3.140      0.190913   \n",
+              "...                  ...       ...     ...             ...           ...   \n",
+              "9994               15.10     2.564  0.2450           0.550      0.160610   \n",
+              "9995               13.60     8.978  0.1130           0.376      0.235174   \n",
+              "9996               14.30       NaN     NaN             NaN      0.113059   \n",
+              "9997               15.10     2.235  0.3880           0.373      0.093852   \n",
+              "9998               13.00     7.148  0.2620           0.065      0.071351   \n",
+              "\n",
+              "      inclination     moid_ld object_class  semi_major_axis_au_unit  \\\n",
+              "0       10.594067  620.640533          MBA                 2.769165   \n",
+              "1       34.832932  480.348639          MBA                 2.773841   \n",
+              "2       12.991043  402.514639          MBA                 2.668285   \n",
+              "3        7.141771  443.451432          MBA                 2.361418   \n",
+              "4        5.367427  426.433027          MBA                 2.574037   \n",
+              "...           ...         ...          ...                      ...   \n",
+              "9994     2.311731  388.723233          MBA                 2.390249   \n",
+              "9995     7.657713  444.194746          MBA                 2.796605   \n",
+              "9996     2.459643  495.460110          MBA                 2.545674   \n",
+              "9997     3.912263  373.848377          MBA                 2.160961   \n",
+              "9998     3.198839  632.144398          MBA                 2.839917   \n",
+              "\n",
+              "     hazardous_flag  \n",
+              "0                 N  \n",
+              "1                 N  \n",
+              "2                 N  \n",
+              "3                 N  \n",
+              "4                 N  \n",
+              "...             ...  \n",
+              "9994              N  \n",
+              "9995              N  \n",
+              "9996              N  \n",
+              "9997              N  \n",
+              "9998              N  \n",
+              "\n",
+              "[9999 rows x 13 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-0161aa69-d50f-4d6f-84c1-10dacb278880\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>spk_id</th>\n",
+              "      <th>full_name</th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>diameter</th>\n",
+              "      <th>albedo</th>\n",
+              "      <th>diameter_sigma</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2000001</td>\n",
+              "      <td>1 Ceres</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>939.400</td>\n",
+              "      <td>0.0900</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2000002</td>\n",
+              "      <td>2 Pallas</td>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>545.000</td>\n",
+              "      <td>0.1010</td>\n",
+              "      <td>18.000</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2000003</td>\n",
+              "      <td>3 Juno</td>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>246.596</td>\n",
+              "      <td>0.2140</td>\n",
+              "      <td>10.594</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2000004</td>\n",
+              "      <td>4 Vesta</td>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>525.400</td>\n",
+              "      <td>0.4228</td>\n",
+              "      <td>0.200</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2000005</td>\n",
+              "      <td>5 Astraea</td>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>106.699</td>\n",
+              "      <td>0.2740</td>\n",
+              "      <td>3.140</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>2009995</td>\n",
+              "      <td>9995 Alouette (4805 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.564</td>\n",
+              "      <td>0.2450</td>\n",
+              "      <td>0.550</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>2009996</td>\n",
+              "      <td>9996 ANS (9070 P-L)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>8.978</td>\n",
+              "      <td>0.1130</td>\n",
+              "      <td>0.376</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>2009997</td>\n",
+              "      <td>9997 COBE (1217 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>2009998</td>\n",
+              "      <td>9998 ISO (1293 T-1)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>2.235</td>\n",
+              "      <td>0.3880</td>\n",
+              "      <td>0.373</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>2009999</td>\n",
+              "      <td>9999 Wiles (4196 T-2)</td>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>7.148</td>\n",
+              "      <td>0.2620</td>\n",
+              "      <td>0.065</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 13 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0161aa69-d50f-4d6f-84c1-10dacb278880')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-0161aa69-d50f-4d6f-84c1-10dacb278880');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can see that our datasets consists of both:\n",
+        "\n",
+        "* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.\n",
+        "\n",
+        "* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. \n"
+      ],
+      "metadata": {
+        "id": "8jV9odKhNyF2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "D9uJtHLSSAMC"
+      },
+      "source": [
+        "Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n",
+        "\n",
+        "* **spk_id:** Object primary SPK-ID\n",
+        "* **full_name:** Asteroid name\n",
+        "* **near_earth_object:** Near-earth object flag\n",
+        "* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n",
+        "* **diameter:** object diameter (from equivalent sphere) km Unit\n",
+        "* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n",
+        "* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.\n",
+        "* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  \n",
+        "* **inclination:** angle with respect to x-y ecliptic plane\n",
+        "* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit\n",
+        "* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.\n",
+        "* **Semi-major axis au Unit:** the length of half of the long axis in AU unit\n",
+        "* **hazardous_flag:** Hazardous Asteroid Flag"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DzYVKbwTp72d"
+      },
+      "source": [
+        "Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piRPwH2aqT06"
+      },
+      "outputs": [],
+      "source": [
+        "beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fRvNyahSuX_y"
+      },
+      "source": [
+        "Let's have a look at the number of missing values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 353
+        },
+        "id": "A2PLchW8vXvt",
+        "outputId": "c08d7f23-3a48-4282-a252-66f73cc7fd86"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in long_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_02cba1067de8e024374a26297834d233\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_02cba1067de8e024374a26297834d233\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "near_earth_object           0.000000\n",
+              "absolute_magnitude          0.000000\n",
+              "diameter                   13.111311\n",
+              "albedo                     13.271327\n",
+              "diameter_sigma             14.081408\n",
+              "eccentricity                0.000000\n",
+              "inclination                 0.000000\n",
+              "moid_ld                     0.000000\n",
+              "object_class                0.000000\n",
+              "semi_major_axis_au_unit     0.000000\n",
+              "hazardous_flag              0.000000\n",
+              "dtype: float64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ],
+      "source": [
+        "ib.collect(beam_df.isnull().mean() * 100)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "00MRdFGLwQiD"
+      },
+      "source": [
+        "Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tHYeCHREwvyB",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "outputId": "5b1b2767-6ae9-4920-f96e-fd1f18e697bb"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_0a93a40e87f23e4a235dfd56cd10188b\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     near_earth_object  absolute_magnitude  eccentricity  inclination  \\\n",
+              "0                    N                3.40      0.076009    10.594067   \n",
+              "1                    N                4.20      0.229972    34.832932   \n",
+              "2                    N                5.33      0.256936    12.991043   \n",
+              "3                    N                3.00      0.088721     7.141771   \n",
+              "4                    N                6.90      0.190913     5.367427   \n",
+              "...                ...                 ...           ...          ...   \n",
+              "9994                 N               15.10      0.160610     2.311731   \n",
+              "9995                 N               13.60      0.235174     7.657713   \n",
+              "9996                 N               14.30      0.113059     2.459643   \n",
+              "9997                 N               15.10      0.093852     3.912263   \n",
+              "9998                 N               13.00      0.071351     3.198839   \n",
+              "\n",
+              "         moid_ld object_class  semi_major_axis_au_unit hazardous_flag  \n",
+              "0     620.640533          MBA                 2.769165              N  \n",
+              "1     480.348639          MBA                 2.773841              N  \n",
+              "2     402.514639          MBA                 2.668285              N  \n",
+              "3     443.451432          MBA                 2.361418              N  \n",
+              "4     426.433027          MBA                 2.574037              N  \n",
+              "...          ...          ...                      ...            ...  \n",
+              "9994  388.723233          MBA                 2.390249              N  \n",
+              "9995  444.194746          MBA                 2.796605              N  \n",
+              "9996  495.460110          MBA                 2.545674              N  \n",
+              "9997  373.848377          MBA                 2.160961              N  \n",
+              "9998  632.144398          MBA                 2.839917              N  \n",
+              "\n",
+              "[9999 rows x 8 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>near_earth_object</th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>object_class</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "      <th>hazardous_flag</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.40</td>\n",
+              "      <td>0.076009</td>\n",
+              "      <td>10.594067</td>\n",
+              "      <td>620.640533</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.769165</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>N</td>\n",
+              "      <td>4.20</td>\n",
+              "      <td>0.229972</td>\n",
+              "      <td>34.832932</td>\n",
+              "      <td>480.348639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.773841</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>N</td>\n",
+              "      <td>5.33</td>\n",
+              "      <td>0.256936</td>\n",
+              "      <td>12.991043</td>\n",
+              "      <td>402.514639</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.668285</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>N</td>\n",
+              "      <td>3.00</td>\n",
+              "      <td>0.088721</td>\n",
+              "      <td>7.141771</td>\n",
+              "      <td>443.451432</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.361418</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>N</td>\n",
+              "      <td>6.90</td>\n",
+              "      <td>0.190913</td>\n",
+              "      <td>5.367427</td>\n",
+              "      <td>426.433027</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.574037</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9994</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.160610</td>\n",
+              "      <td>2.311731</td>\n",
+              "      <td>388.723233</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.390249</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9995</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.60</td>\n",
+              "      <td>0.235174</td>\n",
+              "      <td>7.657713</td>\n",
+              "      <td>444.194746</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.796605</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9996</th>\n",
+              "      <td>N</td>\n",
+              "      <td>14.30</td>\n",
+              "      <td>0.113059</td>\n",
+              "      <td>2.459643</td>\n",
+              "      <td>495.460110</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.545674</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9997</th>\n",
+              "      <td>N</td>\n",
+              "      <td>15.10</td>\n",
+              "      <td>0.093852</td>\n",
+              "      <td>3.912263</td>\n",
+              "      <td>373.848377</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.160961</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9998</th>\n",
+              "      <td>N</td>\n",
+              "      <td>13.00</td>\n",
+              "      <td>0.071351</td>\n",
+              "      <td>3.198839</td>\n",
+              "      <td>632.144398</td>\n",
+              "      <td>MBA</td>\n",
+              "      <td>2.839917</td>\n",
+              "      <td>N</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 8 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-f1eb18dc-39ee-4c7c-9c07-b3f2e37fd398');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ],
+      "source": [
+        "beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)\n",
+        "ib.collect(beam_df)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a3PojL3WBqgE"
+      },
+      "source": [
+        "The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sZ2_gB8wENF1"
+      },
+      "source": [
+        "Let's first get both the the numerical columns and categorical columns"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vsWY8xW5d_Wn"
+      },
+      "outputs": [],
+      "source": [
+        "numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()\n",
+        "categorical_cols = list(set(beam_df.columns) - set(numerical_cols))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 356
+        },
+        "id": "Gjc0UlDD-xUn",
+        "outputId": "cadc4402-7edc-43f6-bc6a-bc4f1fd3314a"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "NotImplementedError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/987840581.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Normalizing method_1: Can work but relies on ticket\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumerical_cols\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001
 b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frame_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    426\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    427\u001b[0m     raise NotImplementedError(\n\u001b[0;32m--> 428\u001b[0;31m         \u001b[0;34mf\"{op!r} is not implemented yet. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    429\u001b[0m         \u001b[0;34mf\"If support for {op!r} is important to you, please let the Beam \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m         \u001b[0;34m\"community know by writing to user@beam.apache.org \"\u001b[0m\u001b[0;34m\u
 001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNotImplementedError\u001b[0m: 'loc.setitem' is not implemented yet. If support for 'loc.setitem' is important to you, please let the Beam community know by writing to user@beam.apache.org (see https://beam.apache.org/community/contact-us/) or commenting on https://github.com/apache/beam/issues/20318"
+          ]
+        }
+      ],
+      "source": [
+        "# Normalizing method_1: Can work but relies on ticket #22267\n",
+        "beam_df.loc[:,numerical_cols] = (beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean())/beam_numerical_cols.std()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "v03ABuXJKEmv"
+      },
+      "source": [
+        "Normalizing the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 538
+        },
+        "id": "PD_DTxPCP4hs",
+        "outputId": "be40308f-d27c-46fa-e365-43a635addf6b"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_d18c3f74760fd465e55dc784f6b3cf87\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "      absolute_magnitude  eccentricity  inclination   moid_ld  \\\n",
+              "306            -1.570727     -0.062543    -0.278518  0.373194   \n",
+              "310            -1.631718     -1.724526    -0.736389  1.087833   \n",
+              "546            -1.753698      1.028793     1.415303 -0.339489   \n",
+              "635            -1.875678      0.244869     0.005905  0.214107   \n",
+              "701            -3.278451     -1.570523     2.006145  1.542754   \n",
+              "...                  ...           ...          ...       ...   \n",
+              "9697            0.807888     -1.151809    -0.082944 -0.129556   \n",
+              "9813            1.722740      0.844551    -0.583247 -1.006447   \n",
+              "9868            0.807888     -0.207399    -0.784665 -0.462136   \n",
+              "9903            0.868878      0.460086     0.092258 -0.107597   \n",
+              "9956            0.746898     -0.234132    -0.161116 -0.601379   \n",
+              "\n",
+              "      semi_major_axis_au_unit  \n",
+              "306                  0.357201  \n",
+              "310                  0.344233  \n",
+              "546                  0.139080  \n",
+              "635                  0.367559  \n",
+              "701                  0.829337  \n",
+              "...                       ...  \n",
+              "9697                -0.533538  \n",
+              "9813                -0.677961  \n",
+              "9868                -0.539794  \n",
+              "9903                 0.071794  \n",
+              "9956                -0.664887  \n",
+              "\n",
+              "[9999 rows x 5 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>absolute_magnitude</th>\n",
+              "      <th>eccentricity</th>\n",
+              "      <th>inclination</th>\n",
+              "      <th>moid_ld</th>\n",
+              "      <th>semi_major_axis_au_unit</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>306</th>\n",
+              "      <td>-1.570727</td>\n",
+              "      <td>-0.062543</td>\n",
+              "      <td>-0.278518</td>\n",
+              "      <td>0.373194</td>\n",
+              "      <td>0.357201</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>310</th>\n",
+              "      <td>-1.631718</td>\n",
+              "      <td>-1.724526</td>\n",
+              "      <td>-0.736389</td>\n",
+              "      <td>1.087833</td>\n",
+              "      <td>0.344233</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>546</th>\n",
+              "      <td>-1.753698</td>\n",
+              "      <td>1.028793</td>\n",
+              "      <td>1.415303</td>\n",
+              "      <td>-0.339489</td>\n",
+              "      <td>0.139080</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>635</th>\n",
+              "      <td>-1.875678</td>\n",
+              "      <td>0.244869</td>\n",
+              "      <td>0.005905</td>\n",
+              "      <td>0.214107</td>\n",
+              "      <td>0.367559</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>701</th>\n",
+              "      <td>-3.278451</td>\n",
+              "      <td>-1.570523</td>\n",
+              "      <td>2.006145</td>\n",
+              "      <td>1.542754</td>\n",
+              "      <td>0.829337</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>...</th>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "      <td>...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9697</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-1.151809</td>\n",
+              "      <td>-0.082944</td>\n",
+              "      <td>-0.129556</td>\n",
+              "      <td>-0.533538</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9813</th>\n",
+              "      <td>1.722740</td>\n",
+              "      <td>0.844551</td>\n",
+              "      <td>-0.583247</td>\n",
+              "      <td>-1.006447</td>\n",
+              "      <td>-0.677961</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9868</th>\n",
+              "      <td>0.807888</td>\n",
+              "      <td>-0.207399</td>\n",
+              "      <td>-0.784665</td>\n",
+              "      <td>-0.462136</td>\n",
+              "      <td>-0.539794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9903</th>\n",
+              "      <td>0.868878</td>\n",
+              "      <td>0.460086</td>\n",
+              "      <td>0.092258</td>\n",
+              "      <td>-0.107597</td>\n",
+              "      <td>0.071794</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9956</th>\n",
+              "      <td>0.746898</td>\n",
+              "      <td>-0.234132</td>\n",
+              "      <td>-0.161116</td>\n",
+              "      <td>-0.601379</td>\n",
+              "      <td>-0.664887</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>9999 rows × 5 columns</p>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-5bcfe283-1b7d-4af1-af32-05eea5ddacbc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "# Get numerical columns\n",
+        "beam_numerical_cols = beam_df.filter(items=numerical_cols)\n",
+        "\n",
+        "# Standarize dataframes only with numerical columns\n",
+        "beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()\n",
+        "\n",
+        "ib.collect(beam_numerical_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qdNILsajFvex"
+      },
+      "source": [
+        "Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "Ngoxg0rSywVd",
+        "outputId": "d81bb29a-f8f8-4186-b186-5ff85667dbec"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1671644751.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mbeam_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'object_class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'get_dummies'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col= beam_df.filter(items=['object_class'])\n",
+        "object_class_col.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "hz8s7z8caTq-",
+        "outputId": "5c543a6b-0ea1-41f8-afec-9691bbbd1f5b"
+      },
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "AttributeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1927971370.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;31m# df['categories_concat'].str.get_dummies('-')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mobject_class_col\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/dataframe/frames.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   2482\u001b[0m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2483\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2484\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2485\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2486\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[
 0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAttributeError\u001b[0m: 'DeferredDataFrame' object has no attribute 'str'"
+          ]
+        }
+      ],
+      "source": [
+        "object_class_col.str.get_dummies()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rVdSIyCB0spw"
+      },
+      "source": [
+        "# Putting it all together\n",
+        "\n",
+        "Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.\n",
+        "\n",
+        "> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ndaSNond0v8Q",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 498
+        },
+        "outputId": "0155d359-45c9-4345-e1b6-b1881408f049"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/content/beam/sdks/python/apache_beam/dataframe/frame_base.py:145: RuntimeWarning: invalid value encountered in double_scalars\n",
+            "  lambda left, right: getattr(left, op)(right), name=op, args=[other])\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "            <link rel=\"stylesheet\" href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\" integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\" crossorigin=\"anonymous\">\n",
+              "            <div id=\"progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\">\n",
+              "              <div class=\"spinner-border text-info\" role=\"status\"></div>\n",
+              "              <span class=\"text-info\">Processing... collect</span>\n",
+              "            </div>\n",
+              "            "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/javascript": [
+              "\n",
+              "        if (typeof window.interactive_beam_jquery == 'undefined') {\n",
+              "          var jqueryScript = document.createElement('script');\n",
+              "          jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+              "          jqueryScript.type = 'text/javascript';\n",
+              "          jqueryScript.onload = function() {\n",
+              "            var datatableScript = document.createElement('script');\n",
+              "            datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+              "            datatableScript.type = 'text/javascript';\n",
+              "            datatableScript.onload = function() {\n",
+              "              window.interactive_beam_jquery = jQuery.noConflict(true);\n",
+              "              window.interactive_beam_jquery(document).ready(function($){\n",
+              "                \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "              });\n",
+              "            }\n",
+              "            document.head.appendChild(datatableScript);\n",
+              "          };\n",
+              "          document.head.appendChild(jqueryScript);\n",
+              "        } else {\n",
+              "          window.interactive_beam_jquery(document).ready(function($){\n",
+              "            \n",
+              "            $(\"#progress_indicator_fc5349f6a626d7566f941a2f2a1fccfe\").remove();\n",
+              "          });\n",
+              "        }"
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "error",
+          "ename": "ValueError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
+            "\u001b[0;32m/tmp/ipykernel_325/1408061827.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcollect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpreprocessed_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/utils.py\u001b[0m in \u001b[0;36mrun_within_progress_indicator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[0;32mwith\u001b[0m \u001b[0mProgressIndicator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Processing... {func.__name__}'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Done.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 277\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001
 b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    279\u001b[0m   \u001b[0;32mreturn\u001b[0m \u001b[0mrun_within_progress_indicator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/interactive_beam.py\u001b[0m in \u001b[0;36mcollect\u001b[0;34m(pcoll, n, duration, include_window_info)\u001b[0m\n\u001b[1;32m    945\u001b[0m         element_type=element_type)\n\u001b[1;32m    946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 947\u001b[0;31m   \u001b[0mrecording\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecording_manager\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpcoll\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_n\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_duration\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mduration\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    948\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    949\u001b[0m   \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0
 ;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/recording_manager.py\u001b[0m in \u001b[0;36mrecord\u001b[0;34m(self, pcolls, max_n, max_duration)\u001b[0m\n\u001b[1;32m    459\u001b[0m       pf.PipelineFragment(\n\u001b[1;32m    460\u001b[0m           \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muncomputed_pcolls\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m           self.user_pipeline.options).run(blocking=is_remote_run)\n\u001b[0m\u001b[1;32m    462\u001b[0m       \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0muser_pipeline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n
 \u001b[1;32m    463\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, display_pipeline_graph, use_cache, blocking)\u001b[0m\n\u001b[1;32m    111\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_force_compute\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    112\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_blocking\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblocking\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 113\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdeduce_fragment\u001
 b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    114\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    115\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_skip_display\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpreserved_skip_display\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py\u001b[0m in \u001b[0;36mdeduce_fragment\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     98\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_runner_api\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrunner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 100\u001b[0;31m         self._options)\n\u001b[0m\u001b[1;32m    101\u001b[0m     \u001b[0mie\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_env\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_derived_pipeline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[
 0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_runner_pipeline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    102\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mfragment\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/content/beam/sdks/python/apache_beam/pipeline.py\u001b[0m in \u001b[0;36mfrom_runner_api\u001b[0;34m(proto, runner, options, return_context)\u001b[0m\n\u001b[1;32m    990\u001b[0m       \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    991\u001b[0m       \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mpcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mproducer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 992\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No producer for %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    993\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    994\u001b[0m 
     \u001b[0;31m# Inject PBegin input where necessary.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mValueError\u001b[0m: No producer for ref_PCollection_PCollection_265"
+          ]
+        }
+      ],
+      "source": [
+        "# Initialize pipline\n",
+        "p = beam.Pipeline(InteractiveRunner())\n",
+        "\n",
+        "# Create a deferred Beam DataFrame with the contents of our csv file.\n",
+        "beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)\n",

Review Comment:
   Done :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org