You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "damccorm (via GitHub)" <gi...@apache.org> on 2023/04/03 14:19:31 UTC

[GitHub] [beam] damccorm commented on a diff in pull request #25904: Add XGBoost example notebook

damccorm commented on code in PR #25904:
URL: https://github.com/apache/beam/pull/25904#discussion_r1156030359


##########
examples/notebooks/beam-ml/run_inference_xgboost.ipynb:
##########
@@ -0,0 +1,374 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "cellView": "form",
+        "id": "XobBB6Sv8mB3"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Apache Beam RunInference for XGBoost\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\"><img src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\" />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\"><img src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\" />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "DUGbrRuv89CS"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook demonstrates the use of the RunInference transform for XGBoost. Apache Beam RunInference has implementations of the ModelHandler class prebuilt for XGBoost. For more information about the RunInference API, see the [Machine Learning section of the Apache Beam documentation](https://beam.apache.org/documentation/ml/overview/).\n",
+        "\n",
+        "You can choose the appropriate model handler based on your input data type:\n",
+        "\n",
+        "- NumPy model handler\n",
+        "- Pandas DataFrame model handler\n",
+        "- Datatable model handler\n",
+        "- SciPy model handler\n",
+        "\n",
+        "With RunInference, these model handlers manage batching, vectorization, and prediction optimization for your XGBoost pipeline or model.\n",
+        "\n",
+        "This notebook demonstrates the following common RunInference patterns:\n",
+        "\n",
+        "- Generate predictions\n",
+        "- Postprocess results after RunInference\n",
+        "- One model to showcase classification of Iris flowers\n",
+        "- One regression model to showcase prediction of housing prices"
+      ],
+      "metadata": {
+        "id": "6nh2h-sIOAOg"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Before you begin\n",
+        "Complete the following setup steps:\n",
+        "- Install dependencies for Apache Beam."
+      ],
+      "metadata": {
+        "id": "nRCJBcTUOq1k"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# !pip install apache-beam[gcp,dataframe] --quiet\n",
+        "!pip install git+https://github.com/apache/beam.git"
+      ],
+      "metadata": {
+        "id": "gbmH329jOuj1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import xgboost\n",
+        "import apache_beam as beam\n",
+        "from sklearn.datasets import fetch_california_housing\n",
+        "from sklearn.datasets import load_iris\n",
+        "from sklearn.model_selection import train_test_split\n",
+        "\n",
+        "from apache_beam.ml.inference import RunInference\n",
+        "from apache_beam.ml.inference.xgboost_inference import XGBoostModelHandlerNumpy\n",
+        "from apache_beam.options.pipeline_options import PipelineOptions"
+      ],
+      "metadata": {
+        "id": "_O0BN_XqOwp1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "SEED = 999\n",
+        "CLASSIFICATION_MODEL_STATE = '/tmp/classification_model.json'\n",
+        "REGRESSION_MODEL_STATE = '/tmp/regression_model.json'"
+      ],
+      "metadata": {
+        "id": "ue_5a-oaO-Lz"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Load the data from scikit-learn and train XGBoost models\n",
+        "This section demonstrates the following steps:\n",
+        "1. Load the iris and Califorina Housing datasets from scikit-learn and create a classification and regression model.\n",
+        "2. Train the classification and regression model.\n",
+        "3. Save the models in a JSON file using `mode.save_model`. (https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html)\n",
+        "\n",
+        "In this example, you create two models, one to classify Iris flowers and one to predict housing prices in California."
+      ],
+      "metadata": {
+        "id": "74oE5pGgPE0M"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Train the classification model\n",
+        "iris_dataset = load_iris()\n",
+        "x_train_classification, x_test_classification, y_train_classification, y_test_classification = train_test_split(\n",
+        "    iris_dataset['data'], iris_dataset['target'], test_size=.2, random_state=SEED)\n",
+        "booster = xgboost.XGBClassifier(\n",
+        "    n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic')\n",
+        "booster.fit(x_train_classification, y_train_classification)\n",
+        "booster.save_model(CLASSIFICATION_MODEL_STATE)\n",
+        "\n",
+        "# Train the regression model\n",
+        "california_dataset = fetch_california_housing()\n",
+        "x_train_regression, x_test_regression, y_train_regression, y_test_regression = train_test_split(\n",
+        "    california_dataset['data'], california_dataset['target'], test_size=.2, random_state=SEED)\n",
+        "model = xgboost.XGBRegressor(\n",
+        "    n_estimators=1000,\n",
+        "    max_depth=8,\n",
+        "    eta=0.1,\n",
+        "    subsample=0.75,\n",
+        "    colsample_bytree=0.8)\n",
+        "model.fit(x_train_regression, y_train_regression)\n",
+        "model.save_model(REGRESSION_MODEL_STATE)\n",
+        "\n",
+        "\n",
+        "# Reshape the test data as XGBoost expects a batch instead of a single element\n",
+        "# More information: https://xgboost.readthedocs.io/en/stable/prediction.html\n",
+        "x_test_classification = x_test.reshape(5, 6, 4)\n",
+        "x_test_regression = x_test_regression.reshape(258, 16, 8)"
+      ],
+      "metadata": {
+        "id": "KVSKt3pFPBnj"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Postprocessing helper functions"
+      ],
+      "metadata": {
+        "id": "VGQj-B1Abioq"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def translate_labels(inference_results: PredictionResult):\n",
+        "  \"\"\"\n",
+        "    Maps output values (0, 1 or 2) of the XGBoost Iris classification\n",
+        "    model to the names of the different Iris flowers.\n",
+        "    Args:\n",
+        "      inference_results: Array containing the outputs of the XGBoost Iris classification model\n",
+        "    \"\"\"\n",
+        "  return PredictionResult(\n",
+        "      inference_results.example,\n",
+        "      np.vectorize(['Setosa', 'Versicolour',\n",
+        "                    'Virginica'].__getitem__)(inference_results.inference))\n",
+        "\n",
+        "\n",
+        "class FlattenBatchPredictionResults(beam.DoFn):\n",
+        "  \"\"\"This function takes a batch (list) of\n",
+        "  PredictionResults as input and yield all elements\"\"\"\n",

Review Comment:
   ```suggestion
           "  \"\"\"This function takes a PredictionResult containing a batch (list) of\n",
           "  examples and predictions as input and yields all example/prediction pairs\"\"\"\n",
   ```



##########
examples/notebooks/beam-ml/run_inference_xgboost.ipynb:
##########
@@ -0,0 +1,374 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "cellView": "form",
+        "id": "XobBB6Sv8mB3"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Apache Beam RunInference for XGBoost\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\"><img src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\" />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\"><img src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\" />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "DUGbrRuv89CS"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook demonstrates the use of the RunInference transform for XGBoost. Apache Beam RunInference has implementations of the ModelHandler class prebuilt for XGBoost. For more information about the RunInference API, see the [Machine Learning section of the Apache Beam documentation](https://beam.apache.org/documentation/ml/overview/).\n",
+        "\n",
+        "You can choose the appropriate model handler based on your input data type:\n",
+        "\n",
+        "- NumPy model handler\n",
+        "- Pandas DataFrame model handler\n",
+        "- Datatable model handler\n",
+        "- SciPy model handler\n",
+        "\n",
+        "With RunInference, these model handlers manage batching, vectorization, and prediction optimization for your XGBoost pipeline or model.\n",
+        "\n",
+        "This notebook demonstrates the following common RunInference patterns:\n",
+        "\n",
+        "- Generate predictions\n",
+        "- Postprocess results after RunInference\n",
+        "- One model to showcase classification of Iris flowers\n",
+        "- One regression model to showcase prediction of housing prices"
+      ],
+      "metadata": {
+        "id": "6nh2h-sIOAOg"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Before you begin\n",
+        "Complete the following setup steps:\n",
+        "- Install dependencies for Apache Beam."
+      ],
+      "metadata": {
+        "id": "nRCJBcTUOq1k"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# !pip install apache-beam[gcp,dataframe] --quiet\n",
+        "!pip install git+https://github.com/apache/beam.git"
+      ],
+      "metadata": {
+        "id": "gbmH329jOuj1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import xgboost\n",
+        "import apache_beam as beam\n",
+        "from sklearn.datasets import fetch_california_housing\n",
+        "from sklearn.datasets import load_iris\n",
+        "from sklearn.model_selection import train_test_split\n",
+        "\n",
+        "from apache_beam.ml.inference import RunInference\n",
+        "from apache_beam.ml.inference.xgboost_inference import XGBoostModelHandlerNumpy\n",
+        "from apache_beam.options.pipeline_options import PipelineOptions"
+      ],
+      "metadata": {
+        "id": "_O0BN_XqOwp1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "SEED = 999\n",
+        "CLASSIFICATION_MODEL_STATE = '/tmp/classification_model.json'\n",
+        "REGRESSION_MODEL_STATE = '/tmp/regression_model.json'"
+      ],
+      "metadata": {
+        "id": "ue_5a-oaO-Lz"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Load the data from scikit-learn and train XGBoost models\n",
+        "This section demonstrates the following steps:\n",
+        "1. Load the iris and Califorina Housing datasets from scikit-learn and create a classification and regression model.\n",
+        "2. Train the classification and regression model.\n",
+        "3. Save the models in a JSON file using `mode.save_model`. (https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html)\n",
+        "\n",
+        "In this example, you create two models, one to classify Iris flowers and one to predict housing prices in California."
+      ],
+      "metadata": {
+        "id": "74oE5pGgPE0M"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Train the classification model\n",
+        "iris_dataset = load_iris()\n",
+        "x_train_classification, x_test_classification, y_train_classification, y_test_classification = train_test_split(\n",
+        "    iris_dataset['data'], iris_dataset['target'], test_size=.2, random_state=SEED)\n",
+        "booster = xgboost.XGBClassifier(\n",
+        "    n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic')\n",
+        "booster.fit(x_train_classification, y_train_classification)\n",
+        "booster.save_model(CLASSIFICATION_MODEL_STATE)\n",
+        "\n",
+        "# Train the regression model\n",
+        "california_dataset = fetch_california_housing()\n",
+        "x_train_regression, x_test_regression, y_train_regression, y_test_regression = train_test_split(\n",
+        "    california_dataset['data'], california_dataset['target'], test_size=.2, random_state=SEED)\n",
+        "model = xgboost.XGBRegressor(\n",
+        "    n_estimators=1000,\n",
+        "    max_depth=8,\n",
+        "    eta=0.1,\n",
+        "    subsample=0.75,\n",
+        "    colsample_bytree=0.8)\n",
+        "model.fit(x_train_regression, y_train_regression)\n",
+        "model.save_model(REGRESSION_MODEL_STATE)\n",
+        "\n",
+        "\n",
+        "# Reshape the test data as XGBoost expects a batch instead of a single element\n",
+        "# More information: https://xgboost.readthedocs.io/en/stable/prediction.html\n",
+        "x_test_classification = x_test.reshape(5, 6, 4)\n",
+        "x_test_regression = x_test_regression.reshape(258, 16, 8)"
+      ],
+      "metadata": {
+        "id": "KVSKt3pFPBnj"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Postprocessing helper functions"
+      ],
+      "metadata": {
+        "id": "VGQj-B1Abioq"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def translate_labels(inference_results: PredictionResult):\n",
+        "  \"\"\"\n",
+        "    Maps output values (0, 1 or 2) of the XGBoost Iris classification\n",
+        "    model to the names of the different Iris flowers.\n",
+        "    Args:\n",
+        "      inference_results: Array containing the outputs of the XGBoost Iris classification model\n",
+        "    \"\"\"\n",
+        "  return PredictionResult(\n",
+        "      inference_results.example,\n",
+        "      np.vectorize(['Setosa', 'Versicolour',\n",
+        "                    'Virginica'].__getitem__)(inference_results.inference))\n",
+        "\n",
+        "\n",
+        "class FlattenBatchPredictionResults(beam.DoFn):\n",
+        "  \"\"\"This function takes a batch (list) of\n",
+        "  PredictionResults as input and yield all elements\"\"\"\n",
+        "  def process(self, batch_prediction_result: PredictionResult):\n",
+        "    for example, inference in zip(batch_prediction_result.example, batch_prediction_result.inference):\n",
+        "      yield PredictionResult(\n",
+        "          example, inference, batch_prediction_result.model_id)\n"

Review Comment:
   Instead of yielding the result, how about we just print it here (rather than having an extra map to do that everywhere)?



##########
examples/notebooks/beam-ml/run_inference_xgboost.ipynb:
##########
@@ -0,0 +1,374 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "cellView": "form",
+        "id": "XobBB6Sv8mB3"
+      },
+      "outputs": [],
+      "source": [
+        "# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Apache Beam RunInference for XGBoost\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\"><img src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\" />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_xgboost.ipynb\"><img src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\" />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
+      ],
+      "metadata": {
+        "id": "DUGbrRuv89CS"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook demonstrates the use of the RunInference transform for XGBoost. Apache Beam RunInference has implementations of the ModelHandler class prebuilt for XGBoost. For more information about the RunInference API, see the [Machine Learning section of the Apache Beam documentation](https://beam.apache.org/documentation/ml/overview/).\n",
+        "\n",
+        "You can choose the appropriate model handler based on your input data type:\n",
+        "\n",
+        "- NumPy model handler\n",
+        "- Pandas DataFrame model handler\n",
+        "- Datatable model handler\n",
+        "- SciPy model handler\n",
+        "\n",
+        "With RunInference, these model handlers manage batching, vectorization, and prediction optimization for your XGBoost pipeline or model.\n",
+        "\n",
+        "This notebook demonstrates the following common RunInference patterns:\n",
+        "\n",
+        "- Generate predictions\n",
+        "- Postprocess results after RunInference\n",
+        "- One model to showcase classification of Iris flowers\n",
+        "- One regression model to showcase prediction of housing prices"
+      ],
+      "metadata": {
+        "id": "6nh2h-sIOAOg"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Before you begin\n",
+        "Complete the following setup steps:\n",
+        "- Install dependencies for Apache Beam."
+      ],
+      "metadata": {
+        "id": "nRCJBcTUOq1k"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# !pip install apache-beam[gcp,dataframe] --quiet\n",
+        "!pip install git+https://github.com/apache/beam.git"

Review Comment:
   ```suggestion
           "!pip install git+https://github.com/apache/beam.git"
   ```
   
   The comment here doesn't add anything for a user. I filed https://github.com/apache/beam/issues/26077 to update once 2.47 is released



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org