You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemml.apache.org by ac...@apache.org on 2017/08/24 08:12:03 UTC

systemml git commit: [SYSTEMML-1742] Transfer Learning using Caffe VGG-19 model

Repository: systemml
Updated Branches:
  refs/heads/master 7e3c03609 -> 0ee8800b8


[SYSTEMML-1742] Transfer Learning using Caffe VGG-19 model


Project: http://git-wip-us.apache.org/repos/asf/systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/systemml/commit/0ee8800b
Tree: http://git-wip-us.apache.org/repos/asf/systemml/tree/0ee8800b
Diff: http://git-wip-us.apache.org/repos/asf/systemml/diff/0ee8800b

Branch: refs/heads/master
Commit: 0ee8800b8e10b65983c61677b00c2bfb185c1d38
Parents: 7e3c036
Author: Arvind Surve <ac...@yahoo.com>
Authored: Thu Aug 24 01:09:26 2017 -0700
Committer: Arvind Surve <ac...@yahoo.com>
Committed: Thu Aug 24 01:11:01 2017 -0700

----------------------------------------------------------------------
 ...lassify_Using_VGG_19_Transfer_Learning.ipynb | 520 +++++++++++++++++++
 1 file changed, 520 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/systemml/blob/0ee8800b/samples/jupyter-notebooks/Image_Classify_Using_VGG_19_Transfer_Learning.ipynb
----------------------------------------------------------------------
diff --git a/samples/jupyter-notebooks/Image_Classify_Using_VGG_19_Transfer_Learning.ipynb b/samples/jupyter-notebooks/Image_Classify_Using_VGG_19_Transfer_Learning.ipynb
new file mode 100644
index 0000000..048308a
--- /dev/null
+++ b/samples/jupyter-notebooks/Image_Classify_Using_VGG_19_Transfer_Learning.ipynb
@@ -0,0 +1,520 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Image Classification using Caffe VGG-19 model (Transfer Learning)\n",
+    "\n",
+    "This notebook demonstrates importing VGG-19 model from Caffe to SystemML and use that model to do an image classification. VGG-19 model has been trained using ImageNet dataset (1000 classes with ~ 14M images). If an image to be predicted is in one of the class VGG-19 has trained on then accuracy will be higher.\n",
+    "We expect prediction of any image through SystemML using VGG-19 model will be similar to that of image  predicted through Caffe using VGG-19 model directly."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Prerequisite:\n",
+    "1. SystemML Python Package\n",
+    "To run this notebook you need to install systeml 1.0 (Master Branch code as of 08/24/2017 or later) python package.\n",
+    "2. Download Dogs-vs-Cats Kaggle dataset from https://www.kaggle.com/c/dogs-vs-cats/data location to a directory.\n",
+    "   Unzip the train.zip directory to some location and update the variable \"train_dir\" in bottom two cells in which    classifyImagesWTransfLearning() and classifyImages() methods are called to test this change. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### SystemML Python Package information"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Name: systemml\r\n",
+      "Version: 1.0.0\r\n",
+      "Summary: Apache SystemML is a distributed and declarative machine learning platform.\r\n",
+      "Home-page: http://systemml.apache.org/\r\n",
+      "Author: Apache SystemML\r\n",
+      "Author-email: dev@systemml.apache.org\r\n",
+      "License: Apache 2.0\r\n",
+      "Location: /home/asurve/src/anaconda2/lib/python2.7/site-packages\r\n",
+      "Requires: Pillow, numpy, scipy, pandas\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "!pip show systemml"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### SystemML Build information\n",
+    "Following code will show SystemML information which is installed in the environment."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": false,
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "SystemML Built-Time:2017-08-17 19:20:41 UTC\n",
+      "Archiver-Version: Plexus Archiver\n",
+      "Artifact-Id: systemml\n",
+      "Build-Jdk: 1.8.0_121\n",
+      "Build-Time: 2017-08-17 19:20:41 UTC\n",
+      "Built-By: asurve\n",
+      "Created-By: Apache Maven 3.3.9\n",
+      "Group-Id: org.apache.systemml\n",
+      "Main-Class: org.apache.sysml.api.DMLScript\n",
+      "Manifest-Version: 1.0\n",
+      "Minimum-Recommended-Spark-Version: 2.1.0\n",
+      "Version: 1.0.0-SNAPSHOT\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "from systemml import MLContext\n",
+    "ml = MLContext(sc)\n",
+    "print (\"SystemML Built-Time:\"+ ml.buildTime())\n",
+    "print(ml.info())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "collapsed": false,
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# Workaround for Python 2.7.13 to avoid certificate validation issue while downloading any file.\n",
+    "\n",
+    "import ssl\n",
+    "\n",
+    "try:\n",
+    "    _create_unverified_https_context = ssl._create_unverified_context\n",
+    "except AttributeError:\n",
+    "    # Legacy Python that doesn't verify HTTPS certificates by default\n",
+    "    pass\n",
+    "else:\n",
+    "    # Handle target environment that doesn't support HTTPS verification\n",
+    "    ssl._create_default_https_context = _create_unverified_https_context"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# Create label.txt file\n",
+    "\n",
+    "def createLabelFile(fileName):\n",
+    "    file = open(fileName, 'w')\n",
+    "    file.write('1,\"Cat\" \\n')\n",
+    "    file.write('2,\"Dog\" \\n')\n",
+    "    file.close()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Download model, proto files and convert them to SystemML format.\n",
+    "\n",
+    "1. Download Caffe Model (VGG-19), proto files (deployer, network and solver) and label file.\n",
+    "2. Convert the Caffe model into SystemML input format.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# Download caffemodel and proto files \n",
+    "\n",
+    "\n",
+    "def downloadAndConvertModel(downloadDir='.', trained_vgg_weights='trained_vgg_weights'):\n",
+    "    \n",
+    "    # Step 1: Download the VGG-19 model and other files.\n",
+    "    import errno\n",
+    "    import os\n",
+    "    import urllib\n",
+    "\n",
+    "    # Create directory, if exists don't error out\n",
+    "    try:\n",
+    "        os.makedirs(os.path.join(downloadDir,trained_vgg_weights))\n",
+    "    except OSError as exc:  # Python >2.5\n",
+    "        if exc.errno == errno.EEXIST and os.path.isdir(trained_vgg_weights):\n",
+    "            pass\n",
+    "        else:\n",
+    "            raise\n",
+    "        \n",
+    "    # Download deployer, network, solver proto and label files.\n",
+    "    urllib.urlretrieve('https://raw.githubusercontent.com/apache/systemml/master/scripts/nn/examples/caffe2dml/models/imagenet/vgg19/VGG_ILSVRC_19_layers_deploy.proto', os.path.join(downloadDir,'VGG_ILSVRC_19_layers_deploy.proto'))\n",
+    "    urllib.urlretrieve('https://raw.githubusercontent.com/apache/systemml/master/scripts/nn/examples/caffe2dml/models/imagenet/vgg19/VGG_ILSVRC_19_layers_network.proto',os.path.join(downloadDir,'VGG_ILSVRC_19_layers_network.proto'))\n",
+    "    #TODO: After downloading network file (VGG_ILSVRC_19_layers_network.proto) , change num_output from 1000 to 2\n",
+    "    \n",
+    "    urllib.urlretrieve('https://raw.githubusercontent.com/apache/systemml/master/scripts/nn/examples/caffe2dml/models/imagenet/vgg19/VGG_ILSVRC_19_layers_solver.proto',os.path.join(downloadDir,'VGG_ILSVRC_19_layers_solver.proto'))\n",
+    "    # TODO: set values as descrived below in VGG_ILSVRC_19_layers_solver.proto (Possibly through APIs whenever available)\n",
+    "    #  test_iter: 100\n",
+    "    #  stepsize: 40\n",
+    "    #  max_iter: 200\n",
+    "    \n",
+    "    # Create labels for data\n",
+    "    ### 1,\"Cat\"\n",
+    "    ### 2,\"Dog\"\n",
+    "    createLabelFile(os.path.join(downloadDir, trained_vgg_weights, 'labels.txt'))\n",
+    "\n",
+    "    # TODO: Following line commented as its 500MG file, if u need to download it please uncomment it and run.\n",
+    "    # urllib.urlretrieve('http://www.robots.ox.ac.uk/~vgg/software/very_deep/caffe/VGG_ILSVRC_19_layers.caffemodel', os.path.join(downloadDir,'VGG_ILSVRC_19_layers.caffemodel'))\n",
+    "\n",
+    "    # Step 2: Convert the caffemodel to trained_vgg_weights directory\n",
+    "    import systemml as sml\n",
+    "    sml.convert_caffemodel(sc, os.path.join(downloadDir,'VGG_ILSVRC_19_layers_deploy.proto'), os.path.join(downloadDir,'VGG_ILSVRC_19_layers.caffemodel'), os.path.join(downloadDir,trained_vgg_weights))\n",
+    "    \n",
+    "    return"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### PrintTopK\n",
+    "This function will print top K probabilities and indices from the result."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# Print top K indices and probability\n",
+    "\n",
+    "def printTopK(prob, label, k):\n",
+    "    print(label, 'Top ', k, ' Index : ', np.argsort(-prob)[0, :k])\n",
+    "    print(label, 'Top ', k, ' Probability : ', prob[0,np.argsort(-prob)[0, :k]])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Classify images\n",
+    "\n",
+    "This function classify images from images specified through urls.\n",
+    "\n",
+    "###### Input Parameters: \n",
+    "    urls: List of urls\n",
+    "    printTokKData (default False): Whether to print top K indices and probabilities\n",
+    "    topK: Top K elements to be displayed. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import urllib\n",
+    "from systemml.mllearn import Caffe2DML\n",
+    "import systemml as sml\n",
+    "\n",
+    "\n",
+    "def classifyImages(urls,img_shape=(3, 224, 224), printTokKData=False, topK=5, downloadDir='.', trained_vgg_weights='trained_vgg_weights'):\n",
+    "\n",
+    "    size = (img_shape[1], img_shape[2])\n",
+    "    \n",
+    "    vgg = Caffe2DML(sqlCtx, solver=os.path.join(downloadDir,'VGG_ILSVRC_19_layers_solver.proto'), input_shape=img_shape)\n",
+    "    vgg.load(trained_vgg_weights)\n",
+    "\n",
+    "    for url in urls:\n",
+    "        outFile = 'inputTest.jpg'\n",
+    "        urllib.urlretrieve(url, outFile)\n",
+    "    \n",
+    "        from IPython.display import Image, display\n",
+    "        display(Image(filename=outFile))\n",
+    "    \n",
+    "        print (\"Prediction of above image to ImageNet Class using\");\n",
+    "\n",
+    "        ## Do image classification through SystemML processing\n",
+    "        from PIL import Image\n",
+    "        input_image = sml.convertImageToNumPyArr(Image.open(outFile), img_shape=img_shape\n",
+    "                                                , color_mode='BGR', mean=sml.getDatasetMean('VGG_ILSVRC_19_2014'))\n",
+    "        print (\"Image preprocessed through SystemML :: \",  vgg.predict(input_image)[0])\n",
+    "        if(printTopKData == True):\n",
+    "            sysml_proba = vgg.predict_proba(input_image)\n",
+    "            printTopK(sysml_proba, 'SystemML BGR', topK)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from pyspark.ml.linalg import Vectors\n",
+    "import os\n",
+    "import systemml as sml\n",
+    "\n",
+    "\n",
+    "def getLabelFeatures(filename, train_dir, img_shape):\n",
+    "    from PIL import Image\n",
+    "\n",
+    "    vec = Vectors.dense(sml.convertImageToNumPyArr(Image.open(os.path.join(train_dir, filename)), img_shape=img_shape)[0,:])\n",
+    "    if filename.lower().startswith('cat'):\n",
+    "        return (1, vec)\n",
+    "    elif filename.lower().startswith('dog'):\n",
+    "        return (2, vec)\n",
+    "    else:\n",
+    "        raise ValueError('Expected the filename to start with either cat or dog')\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from pyspark.sql.functions import rand\n",
+    "import os\n",
+    "\n",
+    "def createTrainingDF(train_dir, train_data_file, img_shape):\n",
+    "    list_jpeg_files = os.listdir(train_dir)\n",
+    "    # 10 files per partition\n",
+    "    train_df = sc.parallelize(list_jpeg_files, int(len(list_jpeg_files)/10)).map(lambda filename : getLabelFeatures(filename, train_dir, img_shape)).toDF(['label', 'features']).orderBy(rand())\n",
+    "    # Optional: but helps seperates conversion-related from training\n",
+    "    # train_df.write.parquet(train_data_file)  # 'kaggle-cats-dogs.parquet'\n",
+    "    return train_df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "def readTrainingDF(train_dir, train_data_file):\n",
+    "    train_df = sqlContext.read.parquet(train_data_file)\n",
+    "    return train_df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# downloadAndConvertModel(downloadDir, trained_vgg_weights)\n",
+    "# TODO: Take \"TODO\" actions mentioned in the downloadAndConvertModel() function after calling downloadAndConvertModel() function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "def retrainModel(img_shape, downloadDir, trained_vgg_weights, train_dir, train_data_file, vgg_new_model):\n",
+    "\n",
+    "    # Let downloadAndConvertModel() functon be commented out, as it needs to be called separately (which is done in cell above) and manual action to be taken after calling it.\n",
+    "    # downloadAndConvertModel(downloadDir, trained_vgg_weights)\n",
+    "    # TODO: Take \"TODO\" actions mentioned in the downloadAndConvertModel() function after calling that function.\n",
+    "    \n",
+    "    train_df = createTrainingDF(train_dir, train_data_file, img_shape)\n",
+    "    ## Write from input files OR read if its already written/converted\n",
+    "    # train_df = readTrainingDF(train_dir, train_data_file)\n",
+    "        \n",
+    "    # Load the model\n",
+    "    vgg = Caffe2DML(sqlCtx, solver=os.path.join(downloadDir,'VGG_ILSVRC_19_layers_solver.proto'), input_shape=img_shape)\n",
+    "    vgg.load(weights=os.path.join(downloadDir,trained_vgg_weights), ignore_weights=['fc8'])\n",
+    "    vgg.set(debug=True).setExplain(True)\n",
+    "\n",
+    "    # Train the model using new data\n",
+    "    vgg.fit(train_df)\n",
+    "    \n",
+    "    # Save the trained model\n",
+    "    vgg.save(vgg_new_model)\n",
+    "    \n",
+    "    return vgg"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import urllib\n",
+    "from systemml.mllearn import Caffe2DML\n",
+    "import systemml as sml\n",
+    "\n",
+    "\n",
+    "def classifyImagesWTransfLearning(urls, model, img_shape=(3, 224, 224), printTokKData=False, topK=5):\n",
+    "\n",
+    "    size = (img_shape[1], img_shape[2])\n",
+    "    # vgg.load(trained_vgg_weights)\n",
+    "\n",
+    "    for url in urls:\n",
+    "        outFile = 'inputTest.jpg'\n",
+    "        urllib.urlretrieve(url, outFile)\n",
+    "    \n",
+    "        from IPython.display import Image, display\n",
+    "        display(Image(filename=outFile))\n",
+    "    \n",
+    "        print (\"Prediction of above image to ImageNet Class using\");\n",
+    "\n",
+    "        ## Do image classification through SystemML processing\n",
+    "        from PIL import Image\n",
+    "        input_image = sml.convertImageToNumPyArr(Image.open(outFile), img_shape=img_shape\n",
+    "                                                , color_mode='BGR', mean=sml.getDatasetMean('VGG_ILSVRC_19_2014'))\n",
+    "\n",
+    "        print (\"Image preprocessed through SystemML :: \",  model.predict(input_image)[0])\n",
+    "        if(printTopKData == True):\n",
+    "            sysml_proba = model.predict_proba(input_image)\n",
+    "            printTopK(sysml_proba, 'SystemML BGR', topK)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Sample code to retrain the model and use it to classify image through two different way\n",
+    "\n",
+    "There are couple of parameters to set based on what you are looking for.\n",
+    "1. printTopKData (default False): If this parameter gets set to True, then top K results (probabilities and indices) will be displayed. \n",
+    "2. topK (default 5): How many entities (K) to be displayed.\n",
+    "3. Directories, data file name, model name and directory where data has donwloaded."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false,
+    "scrolled": false
+   },
+   "outputs": [],
+   "source": [
+    "# ImageNet specific parameters\n",
+    "img_shape = (3, 224, 224)\n",
+    "\n",
+    "# Setting other than current directory causes \"network file not found\" issue, as network file\n",
+    "# location is defined in solver file which does not have a path, so it searches in current dir.\n",
+    "downloadDir = '.' # /home/asurve/caffe_models' \n",
+    "trained_vgg_weights = 'trained_vgg_weights'\n",
+    "\n",
+    "train_dir = '/home/asurve/data/keggle/dogs_vs_cats_2/train'\n",
+    "train_data_file = 'kaggle-cats-dogs.parquet'\n",
+    "    \n",
+    "vgg_new_model = 'kaggle-cats-dogs-model_2'\n",
+    "    \n",
+    "printTopKData=True\n",
+    "topK=5\n",
+    "\n",
+    "urls = ['http://cdn3-www.dogtime.com/assets/uploads/gallery/goldador-dog-breed-pictures/puppy-1.jpg','https://lh3.googleusercontent.com/-YdeAa1Ff4Ac/VkUnQ4vuZGI/AAAAAAAAAEg/nBiUn4pp6aE/w800-h800/images-6.jpeg','https://upload.wikimedia.org/wikipedia/commons/thumb/5/58/MountainLion.jpg/312px-MountainLion.jpg']\n",
+    "\n",
+    "vgg = retrainModel(img_shape, downloadDir, trained_vgg_weights, train_dir, train_data_file, vgg_new_model)\n",
+    "classifyImagesWTransfLearning(urls, vgg, img_shape, printTopKData, topK)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false,
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "img_shape = (3, 224, 224)\n",
+    "\n",
+    "printTopKData=True\n",
+    "topK=5\n",
+    "\n",
+    "# Setting other than current directory causes \"network file not found\" issue, as network file\n",
+    "# location is defined in solver file which does not have a path, so it searches in current dir.\n",
+    "downloadDir = '.' # /home/asurve/caffe_models' \n",
+    "trained_vgg_weights = 'kaggle-cats-dogs-model_2'\n",
+    "\n",
+    "urls = ['http://cdn3-www.dogtime.com/assets/uploads/gallery/goldador-dog-breed-pictures/puppy-1.jpg','https://lh3.googleusercontent.com/-YdeAa1Ff4Ac/VkUnQ4vuZGI/AAAAAAAAAEg/nBiUn4pp6aE/w800-h800/images-6.jpeg','https://upload.wikimedia.org/wikipedia/commons/thumb/5/58/MountainLion.jpg/312px-MountainLion.jpg']\n",
+    "\n",
+    "classifyImages(urls,img_shape, printTopKData, topK, downloadDir, trained_vgg_weights)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}