You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemds.apache.org by ja...@apache.org on 2020/08/02 19:39:30 UTC

[systemds] branch master updated: Notebook for SystemDS on colab for developers

This is an automated email from the ASF dual-hosted git repository.

janardhan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/systemds.git


The following commit(s) were added to refs/heads/master by this push:
     new 1957118  Notebook for SystemDS on colab for developers
1957118 is described below

commit 19571185773daae611970f9596bddeb48eac2f63
Author: Janardhan Pulivarthi <j1...@protonmail.com>
AuthorDate: Mon Aug 3 01:04:38 2020 +0530

    Notebook for SystemDS on colab for developers
    
    * Creates a workspace with all the dependencies for project build.
    * Helps prototype the DML code in browser.
    
    Closes #999.
---
 notebooks/systemds_dev.ipynb | 642 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 642 insertions(+)

diff --git a/notebooks/systemds_dev.ipynb b/notebooks/systemds_dev.ipynb
new file mode 100644
index 0000000..dd9d706
--- /dev/null
+++ b/notebooks/systemds_dev.ipynb
@@ -0,0 +1,642 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "SystemDS on Colaboratory.ipynb",
+      "provenance": [],
+      "collapsed_sections": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "XX60cA7YuZsw",
+        "colab_type": "text"
+      },
+      "source": [
+        "##### Copyright &copy; 2020 The Apache Software Foundation."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "8GEGDZ9GuZGp",
+        "colab_type": "code",
+        "cellView": "form",
+        "colab": {}
+      },
+      "source": [
+        "# @title Apache Version 2.0 (The \"License\");\n",
+        "#-------------------------------------------------------------\n",
+        "#\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements.  See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership.  The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License.  You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied.  See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License.\n",
+        "#\n",
+        "#-------------------------------------------------------------"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "_BbCdLjRoy2A",
+        "colab_type": "text"
+      },
+      "source": [
+        "### Developer notebook for Apache SystemDS"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "zhdfvxkEq1BX",
+        "colab_type": "text"
+      },
+      "source": [
+        "Run this notebook online at [Google Colab ↗](https://colab.research.google.com/github/apache/systemds/blob/master/notebooks/systemds_dev.ipynb).\n",
+        "\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "efFVuggts1hr",
+        "colab_type": "text"
+      },
+      "source": [
+        "This Jupyter/Colab-based tutorial will interactively walk through development setup and running SystemDS in both the\n",
+        "\n",
+        "A. standalone mode \\\n",
+        "B. with Apache Spark.\n",
+        "\n",
+        "Flow of the notebook:\n",
+        "1. Download and Install the dependencies\n",
+        "2. Go to section **A** or **B**"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "vBC5JPhkGbIV",
+        "colab_type": "text"
+      },
+      "source": [
+        "#### Download and Install the dependencies\n",
+        "\n",
+        "1. **Runtime:** Java (OpenJDK 8 is preferred)\n",
+        "2. **Build:** Apache Maven\n",
+        "3. **Backend:** Apache Spark (optional)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "VkLasseNylPO",
+        "colab_type": "text"
+      },
+      "source": [
+        "##### Setup\n",
+        "\n",
+        "A custom function to run OS commands."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "4Wmf-7jfydVH",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "# Run and print a shell command.\n",
+        "def run(command):\n",
+        "  print('>> {}'.format(command))\n",
+        "  !{command}\n",
+        "  print('')"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "kvD4HBMi0ohY",
+        "colab_type": "text"
+      },
+      "source": [
+        "##### Install Java\n",
+        "Let us install OpenJDK 8. More about [OpenJDK ↗](https://openjdk.java.net/install/)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "8Xnb_ePUyQIL",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
+        "\n",
+        "# run the below command to replace the existing installation\n",
+        "!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java\n",
+        "\n",
+        "import os\n",
+        "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
+        "\n",
+        "!java -version"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "BhmBWf3u3Q0o",
+        "colab_type": "text"
+      },
+      "source": [
+        "##### Install Apache Maven\n",
+        "\n",
+        "SystemDS uses Apache Maven to build and manage the project. More about [Apache Maven ↗](http://maven.apache.org/).\n",
+        "\n",
+        "Maven builds SystemDS using its project object model (POM) and a set of plugins. One would find `pom.xml` find the codebase!"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "I81zPDcblchL",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "# Download the maven source.\n",
+        "maven_version = 'apache-maven-3.6.3'\n",
+        "maven_path = f\"/opt/{maven_version}\"\n",
+        "\n",
+        "if not os.path.exists(maven_path):\n",
+        "  run(f\"wget -q -nc -O apache-maven.zip https://downloads.apache.org/maven/maven-3/3.6.3/binaries/{maven_version}-bin.zip\")\n",
+        "  run('unzip -q -d /opt apache-maven.zip')\n",
+        "  run('rm -f apache-maven.zip')\n",
+        "\n",
+        "# Let's choose the absolute path instead of $PATH environment variable.\n",
+        "def maven(args):\n",
+        "  run(f\"{maven_path}/bin/mvn {args}\")\n",
+        "\n",
+        "maven('-v')"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Xphbe3R43XLw",
+        "colab_type": "text"
+      },
+      "source": [
+        "##### Install Apache Spark (Optional, if you want to work with spark backend)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "_WgEa00pTs3w",
+        "colab_type": "text"
+      },
+      "source": [
+        "NOTE: If spark is not downloaded. Let us make sure the version we are trying to download is officially supported at\n",
+        "https://spark.apache.org/downloads.html"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "3zdtkFkLnskx",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "# Spark and Hadoop version\n",
+        "spark_version = 'spark-2.4.6'\n",
+        "hadoop_version = 'hadoop2.7'\n",
+        "spark_path = f\"/opt/{spark_version}-bin-{hadoop_version}\"\n",
+        "if not os.path.exists(spark_path):\n",
+        "  run(f\"wget -q -nc -O apache-spark.tgz https://downloads.apache.org/spark/{spark_version}/{spark_version}-bin-{hadoop_version}.tgz\")\n",
+        "  run('tar zxf apache-spark.tgz -C /opt')\n",
+        "  run('rm -f apache-spark.tgz')\n",
+        "\n",
+        "os.environ[\"SPARK_HOME\"] = spark_path\n",
+        "os.environ[\"PATH\"] += \":$SPARK_HOME/bin\"\n"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "91pJ5U8k3cjk",
+        "colab_type": "text"
+      },
+      "source": [
+        "#### Get Apache SystemDS\n",
+        "\n",
+        "Apache SystemDS development happens on GitHub at [apache/systemds ↗](https://github.com/apache/systemds)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "SaPIprmg3lKE",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "!git clone https://github.com/apache/systemds systemds --depth=1\n",
+        "%cd systemds"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "40Fo9tPUzbWK",
+        "colab_type": "text"
+      },
+      "source": [
+        "##### Build the project"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "s0Iorb0ICgHa",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "# Logging flags: -q only for ERROR; -X for DEBUG; -e for ERROR\n",
+        "# Option 1: Build only the java codebase\n",
+        "maven('clean package -q')\n",
+        "\n",
+        "# Option 2: For building along with python distribution\n",
+        "# maven('clean package -P distribution')"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "SUGac5w9ZRBQ",
+        "colab_type": "text"
+      },
+      "source": [
+        "### A. Working with SystemDS in **standalone** mode\n",
+        "\n",
+        "NOTE: Let's pay attention to *directories* and *relative paths*. :)\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "g5Nk2Bb4UU2O",
+        "colab_type": "text"
+      },
+      "source": [
+        "##### 1. Set SystemDS environment variables\n",
+        "\n",
+        "These are useful for the `./bin/systemds` script."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "2ZnSzkq8UT32",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "!export SYSTEMDS_ROOT=$(pwd)\n",
+        "!export PATH=$SYSTEMDS_ROOT/bin:$PATH"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "zyLmFCv6ZYk5",
+        "colab_type": "text"
+      },
+      "source": [
+        "##### 2. Download Haberman data\n",
+        "\n",
+        "Data source: https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival\n",
+        "\n",
+        "About: The survival of patients who had undergone surgery for breast cancer.\n",
+        "\n",
+        "Data Attributes:\n",
+        "1. Age of patient at time of operation (numerical)\n",
+        "2. Patient's year of operation (year - 1900, numerical)\n",
+        "3. Number of positive axillary nodes detected (numerical)\n",
+        "4. Survival status (class attribute)\n",
+        "    - 1 = the patient survived 5 years or longer\n",
+        "    - 2 = the patient died within 5 year"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "ZrQFBQehV8SF",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "!mkdir ../data"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "E1ZFCTFmXFY_",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "!wget -P ../data/ http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "FTo8Py_vOGpX",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "# Display first 10 lines of the dataset\n",
+        "# Notice that the test is plain csv with no headers!\n",
+        "!sed -n 1,10p ../data/haberman.data"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Oy2kgVdkaeWK",
+        "colab_type": "text"
+      },
+      "source": [
+        "##### 2.1 Set `metadata` for the data\n",
+        "\n",
+        "The data does not have any info on the value types. So, `metadata` for the data\n",
+        "helps know the size and format for the matrix data as `.mtd` file with the same\n",
+        "name and location as `.data` file."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "vfypIgJWXT6K",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "# generate metadata file for the dataset\n",
+        "!echo '{\"rows\": 306, \"cols\": 4, \"format\": \"csv\"}' > ../data/haberman.data.mtd\n",
+        "\n",
+        "# generate type description for the data\n",
+        "!echo '1,1,1,2' > ../data/types.csv\n",
+        "!echo '{\"rows\": 1, \"cols\": 4, \"format\": \"csv\"}' > ../data/types.csv.mtd"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "7Vis3V31bA53",
+        "colab_type": "text"
+      },
+      "source": [
+        "##### 3. Find the algorithm to run with `systemds`"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "L_0KosFhbhun",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "# Inspect the directory structure of systemds code base\n",
+        "!ls"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "R7C5DVM7YfTb",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "# List all the scripts (also called top level algorithms!)\n",
+        "!ls scripts/algorithms"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "5PrxwviWJhNd",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "# Lets choose univariate statistics script.\n",
+        "# Output the algorithm documentation\n",
+        "# start from line no. 22 onwards. Till 35th line the command looks like\n",
+        "!sed -n 22,35p ./scripts/algorithms/Univar-Stats.dml"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "zv_7wRPFSeuJ",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "!./bin/systemds ./scripts/algorithms/Univar-Stats.dml -nvargs X=../data/haberman.data TYPES=../data/types.csv STATS=../data/univarOut.mtx CONSOLE_OUTPUT=TRUE"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "IqY_ARNnavrC",
+        "colab_type": "text"
+      },
+      "source": [
+        "##### 3.1 Let us inspect the output data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "k-_eQg9TauPi",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "# output first 10 lines only.\n",
+        "!sed -n 1,10p ../data/univarOut.mtx"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "o5VCCweiDMjf",
+        "colab_type": "text"
+      },
+      "source": [
+        "#### B. Run SystemDS with Apache Spark"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "6gJhL7lc1vf7",
+        "colab_type": "text"
+      },
+      "source": [
+        "#### Playground for DML scripts\n",
+        "\n",
+        "DML - A custom language designed for SystemDS with R-like syntax."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "zzqeSor__U6M",
+        "colab_type": "text"
+      },
+      "source": [
+        "##### A test `dml` script to prototype algorithms\n",
+        "\n",
+        "Modify the code in the below cell and run to work develop data science tasks\n",
+        "in a high level language."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "t59rTyNbOF5b",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "%%writefile ../test.dml\n",
+        "\n",
+        "# This code code acts as a playground for dml code\n",
+        "X = rand (rows = 20, cols = 10)\n",
+        "y = X %*% rand(rows = ncol(X), cols = 1)\n",
+        "lm(X = X, y = y)"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "VDfeuJYE1JfK",
+        "colab_type": "text"
+      },
+      "source": [
+        "Submit the `dml` script to Spark with `spark-submit`.\n",
+        "More about [Spark Submit ↗](https://spark.apache.org/docs/latest/submitting-applications.html)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "YokktyNE1Cig",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "!$SPARK_HOME/bin/spark-submit \\\n",
+        "    ./target/SystemDS.jar -f ../test.dml"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "gCMkudo_-8_8",
+        "colab_type": "text"
+      },
+      "source": [
+        "##### Run a binary classification example with sample data\n",
+        "\n",
+        "One would notice that no other script than simple dml is used in this example completely."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "OSLq2cZb_SUl",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "# Example binary classification task with sample data.\n",
+        "# !$SPARK_HOME/bin/spark-submit ./target/SystemDS.jar -f ./scripts/nn/examples/fm-binclass-dummy-data.dml"
+      ],
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}
\ No newline at end of file