You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemds.apache.org by ja...@apache.org on 2020/08/02 19:39:30 UTC
[systemds] branch master updated: Notebook for SystemDS on colab
for developers
This is an automated email from the ASF dual-hosted git repository.
janardhan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/systemds.git
The following commit(s) were added to refs/heads/master by this push:
new 1957118 Notebook for SystemDS on colab for developers
1957118 is described below
commit 19571185773daae611970f9596bddeb48eac2f63
Author: Janardhan Pulivarthi <j1...@protonmail.com>
AuthorDate: Mon Aug 3 01:04:38 2020 +0530
Notebook for SystemDS on colab for developers
* Creates a workspace with all the dependencies for project build.
* Helps prototype the DML code in browser.
Closes #999.
---
notebooks/systemds_dev.ipynb | 642 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 642 insertions(+)
diff --git a/notebooks/systemds_dev.ipynb b/notebooks/systemds_dev.ipynb
new file mode 100644
index 0000000..dd9d706
--- /dev/null
+++ b/notebooks/systemds_dev.ipynb
@@ -0,0 +1,642 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "SystemDS on Colaboratory.ipynb",
+ "provenance": [],
+ "collapsed_sections": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XX60cA7YuZsw",
+ "colab_type": "text"
+ },
+ "source": [
+ "##### Copyright © 2020 The Apache Software Foundation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8GEGDZ9GuZGp",
+ "colab_type": "code",
+ "cellView": "form",
+ "colab": {}
+ },
+ "source": [
+ "# @title Apache Version 2.0 (The \"License\");\n",
+ "#-------------------------------------------------------------\n",
+ "#\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License.\n",
+ "#\n",
+ "#-------------------------------------------------------------"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_BbCdLjRoy2A",
+ "colab_type": "text"
+ },
+ "source": [
+ "### Developer notebook for Apache SystemDS"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zhdfvxkEq1BX",
+ "colab_type": "text"
+ },
+ "source": [
+ "Run this notebook online at [Google Colab ↗](https://colab.research.google.com/github/apache/systemds/blob/master/notebooks/systemds_dev.ipynb).\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "efFVuggts1hr",
+ "colab_type": "text"
+ },
+ "source": [
+ "This Jupyter/Colab-based tutorial will interactively walk through development setup and running SystemDS in both the\n",
+ "\n",
+ "A. standalone mode \\\n",
+ "B. with Apache Spark.\n",
+ "\n",
+ "Flow of the notebook:\n",
+ "1. Download and Install the dependencies\n",
+ "2. Go to section **A** or **B**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vBC5JPhkGbIV",
+ "colab_type": "text"
+ },
+ "source": [
+ "#### Download and Install the dependencies\n",
+ "\n",
+ "1. **Runtime:** Java (OpenJDK 8 is preferred)\n",
+ "2. **Build:** Apache Maven\n",
+ "3. **Backend:** Apache Spark (optional)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VkLasseNylPO",
+ "colab_type": "text"
+ },
+ "source": [
+ "##### Setup\n",
+ "\n",
+ "A custom function to run OS commands."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4Wmf-7jfydVH",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "# Run and print a shell command.\n",
+ "def run(command):\n",
+ " print('>> {}'.format(command))\n",
+ " !{command}\n",
+ " print('')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kvD4HBMi0ohY",
+ "colab_type": "text"
+ },
+ "source": [
+ "##### Install Java\n",
+ "Let us install OpenJDK 8. More about [OpenJDK ↗](https://openjdk.java.net/install/)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8Xnb_ePUyQIL",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
+ "\n",
+ "# run the below command to replace the existing installation\n",
+ "!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java\n",
+ "\n",
+ "import os\n",
+ "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
+ "\n",
+ "!java -version"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BhmBWf3u3Q0o",
+ "colab_type": "text"
+ },
+ "source": [
+ "##### Install Apache Maven\n",
+ "\n",
+ "SystemDS uses Apache Maven to build and manage the project. More about [Apache Maven ↗](http://maven.apache.org/).\n",
+ "\n",
+ "Maven builds SystemDS using its project object model (POM) and a set of plugins. One would find `pom.xml` find the codebase!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "I81zPDcblchL",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "# Download the maven source.\n",
+ "maven_version = 'apache-maven-3.6.3'\n",
+ "maven_path = f\"/opt/{maven_version}\"\n",
+ "\n",
+ "if not os.path.exists(maven_path):\n",
+ " run(f\"wget -q -nc -O apache-maven.zip https://downloads.apache.org/maven/maven-3/3.6.3/binaries/{maven_version}-bin.zip\")\n",
+ " run('unzip -q -d /opt apache-maven.zip')\n",
+ " run('rm -f apache-maven.zip')\n",
+ "\n",
+ "# Let's choose the absolute path instead of $PATH environment variable.\n",
+ "def maven(args):\n",
+ " run(f\"{maven_path}/bin/mvn {args}\")\n",
+ "\n",
+ "maven('-v')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Xphbe3R43XLw",
+ "colab_type": "text"
+ },
+ "source": [
+ "##### Install Apache Spark (Optional, if you want to work with spark backend)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_WgEa00pTs3w",
+ "colab_type": "text"
+ },
+ "source": [
+ "NOTE: If spark is not downloaded. Let us make sure the version we are trying to download is officially supported at\n",
+ "https://spark.apache.org/downloads.html"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3zdtkFkLnskx",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "# Spark and Hadoop version\n",
+ "spark_version = 'spark-2.4.6'\n",
+ "hadoop_version = 'hadoop2.7'\n",
+ "spark_path = f\"/opt/{spark_version}-bin-{hadoop_version}\"\n",
+ "if not os.path.exists(spark_path):\n",
+ " run(f\"wget -q -nc -O apache-spark.tgz https://downloads.apache.org/spark/{spark_version}/{spark_version}-bin-{hadoop_version}.tgz\")\n",
+ " run('tar zxf apache-spark.tgz -C /opt')\n",
+ " run('rm -f apache-spark.tgz')\n",
+ "\n",
+ "os.environ[\"SPARK_HOME\"] = spark_path\n",
+ "os.environ[\"PATH\"] += \":$SPARK_HOME/bin\"\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "91pJ5U8k3cjk",
+ "colab_type": "text"
+ },
+ "source": [
+ "#### Get Apache SystemDS\n",
+ "\n",
+ "Apache SystemDS development happens on GitHub at [apache/systemds ↗](https://github.com/apache/systemds)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SaPIprmg3lKE",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "!git clone https://github.com/apache/systemds systemds --depth=1\n",
+ "%cd systemds"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "40Fo9tPUzbWK",
+ "colab_type": "text"
+ },
+ "source": [
+ "##### Build the project"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "s0Iorb0ICgHa",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "# Logging flags: -q only for ERROR; -X for DEBUG; -e for ERROR\n",
+ "# Option 1: Build only the java codebase\n",
+ "maven('clean package -q')\n",
+ "\n",
+ "# Option 2: For building along with python distribution\n",
+ "# maven('clean package -P distribution')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "SUGac5w9ZRBQ",
+ "colab_type": "text"
+ },
+ "source": [
+ "### A. Working with SystemDS in **standalone** mode\n",
+ "\n",
+ "NOTE: Let's pay attention to *directories* and *relative paths*. :)\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "g5Nk2Bb4UU2O",
+ "colab_type": "text"
+ },
+ "source": [
+ "##### 1. Set SystemDS environment variables\n",
+ "\n",
+ "These are useful for the `./bin/systemds` script."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2ZnSzkq8UT32",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "!export SYSTEMDS_ROOT=$(pwd)\n",
+ "!export PATH=$SYSTEMDS_ROOT/bin:$PATH"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zyLmFCv6ZYk5",
+ "colab_type": "text"
+ },
+ "source": [
+ "##### 2. Download Haberman data\n",
+ "\n",
+ "Data source: https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival\n",
+ "\n",
+ "About: The survival of patients who had undergone surgery for breast cancer.\n",
+ "\n",
+ "Data Attributes:\n",
+ "1. Age of patient at time of operation (numerical)\n",
+ "2. Patient's year of operation (year - 1900, numerical)\n",
+ "3. Number of positive axillary nodes detected (numerical)\n",
+ "4. Survival status (class attribute)\n",
+ " - 1 = the patient survived 5 years or longer\n",
+ " - 2 = the patient died within 5 year"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZrQFBQehV8SF",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "!mkdir ../data"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "E1ZFCTFmXFY_",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "!wget -P ../data/ http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FTo8Py_vOGpX",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "# Display first 10 lines of the dataset\n",
+ "# Notice that the test is plain csv with no headers!\n",
+ "!sed -n 1,10p ../data/haberman.data"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Oy2kgVdkaeWK",
+ "colab_type": "text"
+ },
+ "source": [
+ "##### 2.1 Set `metadata` for the data\n",
+ "\n",
+ "The data does not have any info on the value types. So, `metadata` for the data\n",
+ "helps know the size and format for the matrix data as `.mtd` file with the same\n",
+ "name and location as `.data` file."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vfypIgJWXT6K",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "# generate metadata file for the dataset\n",
+ "!echo '{\"rows\": 306, \"cols\": 4, \"format\": \"csv\"}' > ../data/haberman.data.mtd\n",
+ "\n",
+ "# generate type description for the data\n",
+ "!echo '1,1,1,2' > ../data/types.csv\n",
+ "!echo '{\"rows\": 1, \"cols\": 4, \"format\": \"csv\"}' > ../data/types.csv.mtd"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7Vis3V31bA53",
+ "colab_type": "text"
+ },
+ "source": [
+ "##### 3. Find the algorithm to run with `systemds`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "L_0KosFhbhun",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "# Inspect the directory structure of systemds code base\n",
+ "!ls"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "R7C5DVM7YfTb",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "# List all the scripts (also called top level algorithms!)\n",
+ "!ls scripts/algorithms"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5PrxwviWJhNd",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "# Lets choose univariate statistics script.\n",
+ "# Output the algorithm documentation\n",
+ "# start from line no. 22 onwards. Till 35th line the command looks like\n",
+ "!sed -n 22,35p ./scripts/algorithms/Univar-Stats.dml"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zv_7wRPFSeuJ",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "!./bin/systemds ./scripts/algorithms/Univar-Stats.dml -nvargs X=../data/haberman.data TYPES=../data/types.csv STATS=../data/univarOut.mtx CONSOLE_OUTPUT=TRUE"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IqY_ARNnavrC",
+ "colab_type": "text"
+ },
+ "source": [
+ "##### 3.1 Let us inspect the output data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "k-_eQg9TauPi",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "# output first 10 lines only.\n",
+ "!sed -n 1,10p ../data/univarOut.mtx"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "o5VCCweiDMjf",
+ "colab_type": "text"
+ },
+ "source": [
+ "#### B. Run SystemDS with Apache Spark"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6gJhL7lc1vf7",
+ "colab_type": "text"
+ },
+ "source": [
+ "#### Playground for DML scripts\n",
+ "\n",
+ "DML - A custom language designed for SystemDS with R-like syntax."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zzqeSor__U6M",
+ "colab_type": "text"
+ },
+ "source": [
+ "##### A test `dml` script to prototype algorithms\n",
+ "\n",
+ "Modify the code in the below cell and run to work develop data science tasks\n",
+ "in a high level language."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "t59rTyNbOF5b",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "%%writefile ../test.dml\n",
+ "\n",
+ "# This code code acts as a playground for dml code\n",
+ "X = rand (rows = 20, cols = 10)\n",
+ "y = X %*% rand(rows = ncol(X), cols = 1)\n",
+ "lm(X = X, y = y)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VDfeuJYE1JfK",
+ "colab_type": "text"
+ },
+ "source": [
+ "Submit the `dml` script to Spark with `spark-submit`.\n",
+ "More about [Spark Submit ↗](https://spark.apache.org/docs/latest/submitting-applications.html)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YokktyNE1Cig",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "!$SPARK_HOME/bin/spark-submit \\\n",
+ " ./target/SystemDS.jar -f ../test.dml"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gCMkudo_-8_8",
+ "colab_type": "text"
+ },
+ "source": [
+ "##### Run a binary classification example with sample data\n",
+ "\n",
+ "One would notice that no other script than simple dml is used in this example completely."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "OSLq2cZb_SUl",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "source": [
+ "# Example binary classification task with sample data.\n",
+ "# !$SPARK_HOME/bin/spark-submit ./target/SystemDS.jar -f ./scripts/nn/examples/fm-binclass-dummy-data.dml"
+ ],
+ "execution_count": null,
+ "outputs": []
+ }
+ ]
+}
\ No newline at end of file