You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2020/09/21 18:25:59 UTC

[GitHub] [systemds] Baunsgaard opened a new pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Baunsgaard opened a new pull request #1061:
URL: https://github.com/apache/systemds/pull/1061


   Python Tutorial for Mnist dataset using LogReg as the algorithm.
   
   This pr also adds:
   
   - One Hot Encoding in python
   - an Mnist data-set parser for downloading and parsing Mnist easily
   - moving matrix python tests to own folder.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard closed pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

Baunsgaard closed pull request #1061:
URL: https://github.com/apache/systemds/pull/1061


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard commented on pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

Baunsgaard commented on pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#issuecomment-697972277


   @sebwrede 
   Now it should work all the time with any tolerance. If you want to try again!
   
   Otherwise if it LGT(you) then i will merge.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard commented on pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

Baunsgaard commented on pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#issuecomment-696803217


   This should be ready for review now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard closed pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

Baunsgaard closed pull request #1061:
URL: https://github.com/apache/systemds/pull/1061


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] sebwrede commented on a change in pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

sebwrede commented on a change in pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#discussion_r493422672



##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,177 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNIST <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one wants to skip the explanation then the full script is available at the bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+SystemDS provides builtin for downloading and setup of the MNIST dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()
+    Y = d.get_train_labels()
+
+Here the DataManager contains the code for downloading and setting up numpy arrays containing the data.
+
+Step 2: Reshape & Format
+------------------------
+
+Usually data does not come in formats that perfectly fits the algorithms, to make this tutorial more
+realistic some data preprocessing is required to change the input to fit.
+
+First the Training data, X, has multiple dimensions resulting in a shape (60000, 28, 28).
+The dimensions correspond to first the number of images 60000, then the number of row pixels, 28,
+and finally the column pixels, 28.
+
+To use this data for Logistic Regression we have to reduce the dimensions.
+The input X is the training data. 
+It require the data to have two dimensions, the first resemble the
+number of inputs, and the other the number of features.
+
+Therefore to make the data fit the algorithm we reshape the X dataset, like so::
+
+    X = X.reshape((60000, 28*28))
+
+This takes each row of pixels and append to each other making a single feature vector per image.
+
+The Y dataset also does not perfectly fit the Logistic Regression algorithm, this is because the labels
+for this dataset is values ranging from 0, to 9, each label correspond to the integer shown in the image.
+unfortunately the algorithm require the labels to be distinct integers from 1 and upwards.
+
+Therefore we add 1 to each label such that the labels go from 1 to 10, like this::
+
+    Y = Y + 1
+
+With these steps we are now ready to train a simple logistic model.
+
+Step 3: Training
+----------------
+
+To start with, we setup a SystemDS context::
+
+    from systemds.context import SystemDSContext
+    sds = SystemDSContext()

Review comment:
       `File "/systemds/src/main/python/systemds/context/systemds_context.py", line 96, in __init__
       raise Exception("Exception in startup of GatewayServer: " + stderr)
   Exception: Exception in startup of GatewayServer: Error: Could not find or load main class org.apache.sysds.api.PythonDMLScript
   `




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard commented on pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

Baunsgaard commented on pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#issuecomment-696290794


   Also added shape as a variable for the OperatorNode, this is not a finished concept yet since reading files from disk does not contain the shapes, and multi return does not return the shapes correctly either.
   
   Future work would be to improve the design of the multi returns such that one is able to use individual parts of these in subsequent operations. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard commented on a change in pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

Baunsgaard commented on a change in pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#discussion_r493544389



##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,177 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNIST <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one wants to skip the explanation then the full script is available at the bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+SystemDS provides builtin for downloading and setup of the MNIST dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()
+    Y = d.get_train_labels()
+
+Here the DataManager contains the code for downloading and setting up numpy arrays containing the data.
+
+Step 2: Reshape & Format
+------------------------
+
+Usually data does not come in formats that perfectly fits the algorithms, to make this tutorial more
+realistic some data preprocessing is required to change the input to fit.
+
+First the Training data, X, has multiple dimensions resulting in a shape (60000, 28, 28).
+The dimensions correspond to first the number of images 60000, then the number of row pixels, 28,
+and finally the column pixels, 28.
+
+To use this data for Logistic Regression we have to reduce the dimensions.
+The input X is the training data. 
+It require the data to have two dimensions, the first resemble the
+number of inputs, and the other the number of features.
+
+Therefore to make the data fit the algorithm we reshape the X dataset, like so::
+
+    X = X.reshape((60000, 28*28))
+
+This takes each row of pixels and append to each other making a single feature vector per image.
+
+The Y dataset also does not perfectly fit the Logistic Regression algorithm, this is because the labels
+for this dataset is values ranging from 0, to 9, each label correspond to the integer shown in the image.
+unfortunately the algorithm require the labels to be distinct integers from 1 and upwards.
+
+Therefore we add 1 to each label such that the labels go from 1 to 10, like this::
+
+    Y = Y + 1
+
+With these steps we are now ready to train a simple logistic model.
+
+Step 3: Training
+----------------
+
+To start with, we setup a SystemDS context::
+
+    from systemds.context import SystemDSContext
+    sds = SystemDSContext()

Review comment:
       should be resolved now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] sebwrede commented on a change in pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

sebwrede commented on a change in pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#discussion_r493476559



##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,177 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNIST <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one wants to skip the explanation then the full script is available at the bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+SystemDS provides builtin for downloading and setup of the MNIST dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()
+    Y = d.get_train_labels()
+
+Here the DataManager contains the code for downloading and setting up numpy arrays containing the data.
+
+Step 2: Reshape & Format
+------------------------
+
+Usually data does not come in formats that perfectly fits the algorithms, to make this tutorial more
+realistic some data preprocessing is required to change the input to fit.
+
+First the Training data, X, has multiple dimensions resulting in a shape (60000, 28, 28).
+The dimensions correspond to first the number of images 60000, then the number of row pixels, 28,
+and finally the column pixels, 28.
+
+To use this data for Logistic Regression we have to reduce the dimensions.
+The input X is the training data. 
+It require the data to have two dimensions, the first resemble the
+number of inputs, and the other the number of features.
+
+Therefore to make the data fit the algorithm we reshape the X dataset, like so::
+
+    X = X.reshape((60000, 28*28))
+
+This takes each row of pixels and append to each other making a single feature vector per image.
+
+The Y dataset also does not perfectly fit the Logistic Regression algorithm, this is because the labels
+for this dataset is values ranging from 0, to 9, each label correspond to the integer shown in the image.
+unfortunately the algorithm require the labels to be distinct integers from 1 and upwards.
+
+Therefore we add 1 to each label such that the labels go from 1 to 10, like this::
+
+    Y = Y + 1
+
+With these steps we are now ready to train a simple logistic model.
+
+Step 3: Training
+----------------
+
+To start with, we setup a SystemDS context::
+
+    from systemds.context import SystemDSContext
+    sds = SystemDSContext()

Review comment:
       It is required to run `export SYSTEMDS_ROOT=$(pwd)` in SystemDS root before doing this step.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] sebwrede commented on a change in pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

sebwrede commented on a change in pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#discussion_r493347509



##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,176 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNist <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one want to skip the explanation then the full script is available in to bottom of this page.

Review comment:
       > If one want* to
   
   *wants 
   
   > available (in to)* bottom
   *at the

##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,176 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNist <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one want to skip the explanation then the full script is available in to bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+Systemds provide builtin for downloading and setup of the mnist dataset.

Review comment:
       > Systemds provide*
   *provides

##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,176 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNist <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one want to skip the explanation then the full script is available in to bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+Systemds provide builtin for downloading and setup of the mnist dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()
+    Y = d.get_train_labels()
+
+Here the DataManager contains the code for downloading and setting up numpy arrays containing the data.
+
+Step 2: Reshape & Format
+------------------------
+
+Usually data does not come in formats that perfectly fits the algorithms, to make this tutorial more
+realistic some data preprocessing is required to change the input to fit.
+
+First the Training data, X, has multiple dimensions resulting in a shape (60000, 28, 28).
+these dimensions corresponds to first the number of images 60000, then the number of row pixels, 28,

Review comment:
       *correspond

##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,176 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNist <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one want to skip the explanation then the full script is available in to bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+Systemds provide builtin for downloading and setup of the mnist dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()

Review comment:
       I get: 
   
   `>>> X = d.get_train_data()
   
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/.local/lib/python3.8/site-packages/systemds/examples/tutorials/mnist.py", line 56, in get_train_data
       self._get_data(self._train_data_url, self._train_data_loc)
     File "/.local/lib/python3.8/site-packages/systemds/examples/tutorials/mnist.py", line 102, in _get_data
       os.mkdir(folder)
   FileNotFoundError: [Errno 2] No such file or directory: 'systemds/examples/tutorials/mnist'
   `

##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,176 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNist <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one want to skip the explanation then the full script is available in to bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+Systemds provide builtin for downloading and setup of the mnist dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()
+    Y = d.get_train_labels()
+
+Here the DataManager contains the code for downloading and setting up numpy arrays containing the data.
+
+Step 2: Reshape & Format
+------------------------
+
+Usually data does not come in formats that perfectly fits the algorithms, to make this tutorial more
+realistic some data preprocessing is required to change the input to fit.
+
+First the Training data, X, has multiple dimensions resulting in a shape (60000, 28, 28).
+these dimensions corresponds to first the number of images 60000, then the number of row pixels, 28,

Review comment:
       *These

##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,176 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNist <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one want to skip the explanation then the full script is available in to bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+Systemds provide builtin for downloading and setup of the mnist dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()
+    Y = d.get_train_labels()
+
+Here the DataManager contains the code for downloading and setting up numpy arrays containing the data.
+
+Step 2: Reshape & Format
+------------------------
+
+Usually data does not come in formats that perfectly fits the algorithms, to make this tutorial more
+realistic some data preprocessing is required to change the input to fit.
+
+First the Training data, X, has multiple dimensions resulting in a shape (60000, 28, 28).
+these dimensions corresponds to first the number of images 60000, then the number of row pixels, 28,
+and finally the column pixels, 28.
+
+To use this data for Logistic Regression we have to reduce the dimensions.
+The input, X, of this algorithm require the data, to have two dimensions, and the first resemble the

Review comment:
       Check commas




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard commented on pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

Baunsgaard commented on pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#issuecomment-696290794


   Also added shape as a variable for the OperatorNode, this is not a finished concept yet since reading files from disk does not contain the shapes, and multi return does not return the shapes correctly either.
   
   Future work would be to improve the design of the multi returns such that one is able to use individual parts of these in subsequent operations. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard commented on a change in pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

Baunsgaard commented on a change in pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#discussion_r493416856



##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,176 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNist <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one want to skip the explanation then the full script is available in to bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+Systemds provide builtin for downloading and setup of the mnist dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()

Review comment:
       Should be fixed, The issue is that the mkdir method does not create nester folders, i replaced it with mkdirs.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard commented on pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

Baunsgaard commented on pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#issuecomment-697257377


   > This is only the first part of the review. I cannot import SystemDS correctly in Python when I follow the tutorial. Do you know why, @Baunsgaard? See my comment with the error message.
   > Another general comment: write "SystemDS" and not "systemds" unless it is a command where it is important that the characters are not capitalized. You could even write "Apache SystemDS" smile
   
   You need to install systemds python if you want to run it.
   therefore i refer to the install guide that has to be run first.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] sebwrede commented on a change in pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

sebwrede commented on a change in pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#discussion_r494152273



##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,173 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNIST <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one wants to skip the explanation then the full script is available at the bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+SystemDS provides builtin for downloading and setup of the MNIST dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()
+    Y = d.get_train_labels()
+
+Here the DataManager contains the code for downloading and setting up numpy arrays containing the data.
+
+Step 2: Reshape & Format
+------------------------
+
+Usually data does not come in formats that perfectly fits the algorithms, to make this tutorial more
+realistic some data preprocessing is required to change the input to fit.
+
+First the Training data, X, has multiple dimensions resulting in a shape (60000, 28, 28).
+The dimensions correspond to first the number of images 60000, then the number of row pixels, 28,
+and finally the column pixels, 28.
+
+To use this data for Logistic Regression we have to reduce the dimensions.
+The input X is the training data. 
+It require the data to have two dimensions, the first resemble the
+number of inputs, and the other the number of features.
+
+Therefore to make the data fit the algorithm we reshape the X dataset, like so::
+
+    X = X.reshape((60000, 28*28))
+
+This takes each row of pixels and append to each other making a single feature vector per image.
+
+The Y dataset also does not perfectly fit the Logistic Regression algorithm, this is because the labels
+for this dataset is values ranging from 0, to 9, each label correspond to the integer shown in the image.
+unfortunately the algorithm require the labels to be distinct integers from 1 and upwards.
+
+Therefore we add 1 to each label such that the labels go from 1 to 10, like this::
+
+    Y = Y + 1
+
+With these steps we are now ready to train a simple logistic model.
+
+Step 3: Training
+----------------
+
+To start with, we setup a SystemDS context::
+
+    from systemds.context import SystemDSContext
+    sds = SystemDSContext()
+
+Then setup the data::
+
+    from systemds.matrix import Matrix
+    X_ds = Matrix(sds, X)
+    Y_ds = Matrix(sds, Y)
+
+to reduce the training time and verify everything works, it is usually good to reduce the amount of data,
+to train on a smaller sample to start with::
+
+    sample_size = 1000
+    X_ds = Matrix(sds, X[:sample_size])
+    Y_ds = Matrix(sds, Y[:sample_size])
+
+And now everything is ready for our algorithm::
+
+    from systemds.operator.algorithm import multiLogReg
+
+    bias = multiLogReg(X_ds, Y_ds)
+
+Note that nothing has been calculated yet, in SystemDS, since it only happens when you call compute::
+
+    bias_r = bias.compute()
 
-TODO
+bias is a matrix, that if matrix multiplied with an instance returns a value distribution where, the highest value is the predicted type.
+This is the matrix that could be saved and used for predicting labels later.
 
 Step 3: Validate
 ----------------
 
-TODO
+To see what accuracy the model achieves, we have to load in the test dataset as well.
+
+this can also be extracted from our builtin MNIST loader, to keep the tutorial short the operations are combined::
+
+    Xt = Matrix(sds, d.get_test_data().reshape((10000, 28*28)))
+    Yt = Matrix(sds, d.get_test_labels()) + 1
+
+The above loads the test data, and reshapes the X data the same way the training data was reshaped.
+
+Finally we verify the accuracy by calling::
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [m, y_pred, acc] = multiLogRegPredict(Xt, bias, Yt).compute()
+    print(acc)
+
+There are three outputs from the multiLogRegPredict call.
+
+m, is the mean probability of correctly classifying each label.
+y_pred, is the predictions made using the model, bias, trained.
+acc, is the accuracy achieved by the model.
+
+If the subset of the training data is used then you could expect an accuracy of 85% in this example
+using 1000 pictures of the training data.
+
+Step 4: Tuning
+--------------
+
+Now that we have a working baseline we can start tuning parameters.
+
+But first it is valuable to know how much of a difference in performance there is on the training data, vs the test data.
+This gives an indication of if we have exhausted the learning potential of the training data.
+
+To see how our accuracy is on the training data we use the Predict function again, but with our training data::
+
+    [m, y_pred, acc] = multiLogRegPredict(X_ds, bias, Y_ds).compute()
+    print(acc)
+
+In this specific case we achieve 100% accuracy on the training data, indicating that we have fit the training data,
+and have nothing more to learn from the data as it is now.
+
+To improve further we have to increase the training data, here for example we increase it
+from our sample of 1k to the full training dataset of 60k, in this example the maxi is set to reduce the number of iterations the algorithm takes,
+to again reduce training time::
+
+    X_ds = Matrix(sds, X)
+    Y_ds = Matrix(sds, Y)
+
+    bias = multiLogReg(X_ds, Y_ds, maxi=30)
+
+    [_, _, train_acc] = multiLogRegPredict(X_ds, bias, Y_ds).compute()
+    [_, _, test_acc] = multiLogRegPredict(Xt, bias, Yt).compute()
+    print(train_acc  "  " test_acc)
+
+With this change the accuracy achieved changes from the previous value to 92%. This is still low on this dataset as can be seen on `MNIST <http://yann.lecun.com/exdb/mnist/>`_.
+But this is a basic implementation, that can be replaced by a variety of algorithm and techniques.

Review comment:
       But this is a basic implementation that can be replaced by a variety of algorithms and techniques.

##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,173 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNIST <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one wants to skip the explanation then the full script is available at the bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+SystemDS provides builtin for downloading and setup of the MNIST dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()
+    Y = d.get_train_labels()
+
+Here the DataManager contains the code for downloading and setting up numpy arrays containing the data.
+
+Step 2: Reshape & Format
+------------------------
+
+Usually data does not come in formats that perfectly fits the algorithms, to make this tutorial more
+realistic some data preprocessing is required to change the input to fit.
+
+First the Training data, X, has multiple dimensions resulting in a shape (60000, 28, 28).
+The dimensions correspond to first the number of images 60000, then the number of row pixels, 28,
+and finally the column pixels, 28.
+
+To use this data for Logistic Regression we have to reduce the dimensions.
+The input X is the training data. 
+It require the data to have two dimensions, the first resemble the
+number of inputs, and the other the number of features.
+
+Therefore to make the data fit the algorithm we reshape the X dataset, like so::
+
+    X = X.reshape((60000, 28*28))
+
+This takes each row of pixels and append to each other making a single feature vector per image.
+
+The Y dataset also does not perfectly fit the Logistic Regression algorithm, this is because the labels
+for this dataset is values ranging from 0, to 9, each label correspond to the integer shown in the image.
+unfortunately the algorithm require the labels to be distinct integers from 1 and upwards.
+
+Therefore we add 1 to each label such that the labels go from 1 to 10, like this::
+
+    Y = Y + 1
+
+With these steps we are now ready to train a simple logistic model.
+
+Step 3: Training
+----------------
+
+To start with, we setup a SystemDS context::
+
+    from systemds.context import SystemDSContext
+    sds = SystemDSContext()
+
+Then setup the data::
+
+    from systemds.matrix import Matrix
+    X_ds = Matrix(sds, X)
+    Y_ds = Matrix(sds, Y)
+
+to reduce the training time and verify everything works, it is usually good to reduce the amount of data,
+to train on a smaller sample to start with::
+
+    sample_size = 1000
+    X_ds = Matrix(sds, X[:sample_size])
+    Y_ds = Matrix(sds, Y[:sample_size])
+
+And now everything is ready for our algorithm::
+
+    from systemds.operator.algorithm import multiLogReg
+
+    bias = multiLogReg(X_ds, Y_ds)
+
+Note that nothing has been calculated yet, in SystemDS, since it only happens when you call compute::
+
+    bias_r = bias.compute()
 
-TODO
+bias is a matrix, that if matrix multiplied with an instance returns a value distribution where, the highest value is the predicted type.
+This is the matrix that could be saved and used for predicting labels later.
 
 Step 3: Validate
 ----------------
 
-TODO
+To see what accuracy the model achieves, we have to load in the test dataset as well.
+
+this can also be extracted from our builtin MNIST loader, to keep the tutorial short the operations are combined::
+
+    Xt = Matrix(sds, d.get_test_data().reshape((10000, 28*28)))
+    Yt = Matrix(sds, d.get_test_labels()) + 1
+
+The above loads the test data, and reshapes the X data the same way the training data was reshaped.
+
+Finally we verify the accuracy by calling::
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [m, y_pred, acc] = multiLogRegPredict(Xt, bias, Yt).compute()
+    print(acc)
+
+There are three outputs from the multiLogRegPredict call.
+
+m, is the mean probability of correctly classifying each label.
+y_pred, is the predictions made using the model, bias, trained.
+acc, is the accuracy achieved by the model.
+
+If the subset of the training data is used then you could expect an accuracy of 85% in this example
+using 1000 pictures of the training data.
+
+Step 4: Tuning
+--------------
+
+Now that we have a working baseline we can start tuning parameters.
+
+But first it is valuable to know how much of a difference in performance there is on the training data, vs the test data.
+This gives an indication of if we have exhausted the learning potential of the training data.
+
+To see how our accuracy is on the training data we use the Predict function again, but with our training data::
+
+    [m, y_pred, acc] = multiLogRegPredict(X_ds, bias, Y_ds).compute()
+    print(acc)
+
+In this specific case we achieve 100% accuracy on the training data, indicating that we have fit the training data,
+and have nothing more to learn from the data as it is now.
+
+To improve further we have to increase the training data, here for example we increase it
+from our sample of 1k to the full training dataset of 60k, in this example the maxi is set to reduce the number of iterations the algorithm takes,
+to again reduce training time::
+
+    X_ds = Matrix(sds, X)
+    Y_ds = Matrix(sds, Y)
+
+    bias = multiLogReg(X_ds, Y_ds, maxi=30)
+
+    [_, _, train_acc] = multiLogRegPredict(X_ds, bias, Y_ds).compute()
+    [_, _, test_acc] = multiLogRegPredict(Xt, bias, Yt).compute()
+    print(train_acc  "  " test_acc)

Review comment:
       This is invalid syntax. I think it should be: 
   `print(train_acc, "   ", test_acc)`

##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,173 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNIST <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one wants to skip the explanation then the full script is available at the bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+SystemDS provides builtin for downloading and setup of the MNIST dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()
+    Y = d.get_train_labels()
+
+Here the DataManager contains the code for downloading and setting up numpy arrays containing the data.
+
+Step 2: Reshape & Format
+------------------------
+
+Usually data does not come in formats that perfectly fits the algorithms, to make this tutorial more
+realistic some data preprocessing is required to change the input to fit.
+
+First the Training data, X, has multiple dimensions resulting in a shape (60000, 28, 28).
+The dimensions correspond to first the number of images 60000, then the number of row pixels, 28,
+and finally the column pixels, 28.
+
+To use this data for Logistic Regression we have to reduce the dimensions.
+The input X is the training data. 
+It require the data to have two dimensions, the first resemble the
+number of inputs, and the other the number of features.
+
+Therefore to make the data fit the algorithm we reshape the X dataset, like so::
+
+    X = X.reshape((60000, 28*28))
+
+This takes each row of pixels and append to each other making a single feature vector per image.
+
+The Y dataset also does not perfectly fit the Logistic Regression algorithm, this is because the labels
+for this dataset is values ranging from 0, to 9, each label correspond to the integer shown in the image.
+unfortunately the algorithm require the labels to be distinct integers from 1 and upwards.
+
+Therefore we add 1 to each label such that the labels go from 1 to 10, like this::
+
+    Y = Y + 1
+
+With these steps we are now ready to train a simple logistic model.
+
+Step 3: Training
+----------------
+
+To start with, we setup a SystemDS context::
+
+    from systemds.context import SystemDSContext
+    sds = SystemDSContext()
+
+Then setup the data::
+
+    from systemds.matrix import Matrix
+    X_ds = Matrix(sds, X)
+    Y_ds = Matrix(sds, Y)
+
+to reduce the training time and verify everything works, it is usually good to reduce the amount of data,
+to train on a smaller sample to start with::
+
+    sample_size = 1000
+    X_ds = Matrix(sds, X[:sample_size])
+    Y_ds = Matrix(sds, Y[:sample_size])
+
+And now everything is ready for our algorithm::
+
+    from systemds.operator.algorithm import multiLogReg
+
+    bias = multiLogReg(X_ds, Y_ds)
+
+Note that nothing has been calculated yet, in SystemDS, since it only happens when you call compute::
+
+    bias_r = bias.compute()
 
-TODO
+bias is a matrix, that if matrix multiplied with an instance returns a value distribution where, the highest value is the predicted type.
+This is the matrix that could be saved and used for predicting labels later.
 
 Step 3: Validate
 ----------------
 
-TODO
+To see what accuracy the model achieves, we have to load in the test dataset as well.
+
+this can also be extracted from our builtin MNIST loader, to keep the tutorial short the operations are combined::
+
+    Xt = Matrix(sds, d.get_test_data().reshape((10000, 28*28)))
+    Yt = Matrix(sds, d.get_test_labels()) + 1
+
+The above loads the test data, and reshapes the X data the same way the training data was reshaped.
+
+Finally we verify the accuracy by calling::
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [m, y_pred, acc] = multiLogRegPredict(Xt, bias, Yt).compute()
+    print(acc)
+
+There are three outputs from the multiLogRegPredict call.
+
+m, is the mean probability of correctly classifying each label.
+y_pred, is the predictions made using the model, bias, trained.
+acc, is the accuracy achieved by the model.

Review comment:
       This is displayed as one line when opening in IntelliJ. Is there an option to make it appear as an actual list?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] sebwrede commented on a change in pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

sebwrede commented on a change in pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#discussion_r494152273



##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,173 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNIST <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one wants to skip the explanation then the full script is available at the bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+SystemDS provides builtin for downloading and setup of the MNIST dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()
+    Y = d.get_train_labels()
+
+Here the DataManager contains the code for downloading and setting up numpy arrays containing the data.
+
+Step 2: Reshape & Format
+------------------------
+
+Usually data does not come in formats that perfectly fits the algorithms, to make this tutorial more
+realistic some data preprocessing is required to change the input to fit.
+
+First the Training data, X, has multiple dimensions resulting in a shape (60000, 28, 28).
+The dimensions correspond to first the number of images 60000, then the number of row pixels, 28,
+and finally the column pixels, 28.
+
+To use this data for Logistic Regression we have to reduce the dimensions.
+The input X is the training data. 
+It require the data to have two dimensions, the first resemble the
+number of inputs, and the other the number of features.
+
+Therefore to make the data fit the algorithm we reshape the X dataset, like so::
+
+    X = X.reshape((60000, 28*28))
+
+This takes each row of pixels and append to each other making a single feature vector per image.
+
+The Y dataset also does not perfectly fit the Logistic Regression algorithm, this is because the labels
+for this dataset is values ranging from 0, to 9, each label correspond to the integer shown in the image.
+unfortunately the algorithm require the labels to be distinct integers from 1 and upwards.
+
+Therefore we add 1 to each label such that the labels go from 1 to 10, like this::
+
+    Y = Y + 1
+
+With these steps we are now ready to train a simple logistic model.
+
+Step 3: Training
+----------------
+
+To start with, we setup a SystemDS context::
+
+    from systemds.context import SystemDSContext
+    sds = SystemDSContext()
+
+Then setup the data::
+
+    from systemds.matrix import Matrix
+    X_ds = Matrix(sds, X)
+    Y_ds = Matrix(sds, Y)
+
+to reduce the training time and verify everything works, it is usually good to reduce the amount of data,
+to train on a smaller sample to start with::
+
+    sample_size = 1000
+    X_ds = Matrix(sds, X[:sample_size])
+    Y_ds = Matrix(sds, Y[:sample_size])
+
+And now everything is ready for our algorithm::
+
+    from systemds.operator.algorithm import multiLogReg
+
+    bias = multiLogReg(X_ds, Y_ds)
+
+Note that nothing has been calculated yet, in SystemDS, since it only happens when you call compute::
+
+    bias_r = bias.compute()
 
-TODO
+bias is a matrix, that if matrix multiplied with an instance returns a value distribution where, the highest value is the predicted type.
+This is the matrix that could be saved and used for predicting labels later.
 
 Step 3: Validate
 ----------------
 
-TODO
+To see what accuracy the model achieves, we have to load in the test dataset as well.
+
+this can also be extracted from our builtin MNIST loader, to keep the tutorial short the operations are combined::
+
+    Xt = Matrix(sds, d.get_test_data().reshape((10000, 28*28)))
+    Yt = Matrix(sds, d.get_test_labels()) + 1
+
+The above loads the test data, and reshapes the X data the same way the training data was reshaped.
+
+Finally we verify the accuracy by calling::
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [m, y_pred, acc] = multiLogRegPredict(Xt, bias, Yt).compute()
+    print(acc)
+
+There are three outputs from the multiLogRegPredict call.
+
+m, is the mean probability of correctly classifying each label.
+y_pred, is the predictions made using the model, bias, trained.
+acc, is the accuracy achieved by the model.
+
+If the subset of the training data is used then you could expect an accuracy of 85% in this example
+using 1000 pictures of the training data.
+
+Step 4: Tuning
+--------------
+
+Now that we have a working baseline we can start tuning parameters.
+
+But first it is valuable to know how much of a difference in performance there is on the training data, vs the test data.
+This gives an indication of if we have exhausted the learning potential of the training data.
+
+To see how our accuracy is on the training data we use the Predict function again, but with our training data::
+
+    [m, y_pred, acc] = multiLogRegPredict(X_ds, bias, Y_ds).compute()
+    print(acc)
+
+In this specific case we achieve 100% accuracy on the training data, indicating that we have fit the training data,
+and have nothing more to learn from the data as it is now.
+
+To improve further we have to increase the training data, here for example we increase it
+from our sample of 1k to the full training dataset of 60k, in this example the maxi is set to reduce the number of iterations the algorithm takes,
+to again reduce training time::
+
+    X_ds = Matrix(sds, X)
+    Y_ds = Matrix(sds, Y)
+
+    bias = multiLogReg(X_ds, Y_ds, maxi=30)
+
+    [_, _, train_acc] = multiLogRegPredict(X_ds, bias, Y_ds).compute()
+    [_, _, test_acc] = multiLogRegPredict(Xt, bias, Yt).compute()
+    print(train_acc  "  " test_acc)
+
+With this change the accuracy achieved changes from the previous value to 92%. This is still low on this dataset as can be seen on `MNIST <http://yann.lecun.com/exdb/mnist/>`_.
+But this is a basic implementation, that can be replaced by a variety of algorithm and techniques.

Review comment:
       But this is a basic implementation that can be replaced by a variety of algorithms and techniques.

##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,173 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNIST <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one wants to skip the explanation then the full script is available at the bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+SystemDS provides builtin for downloading and setup of the MNIST dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()
+    Y = d.get_train_labels()
+
+Here the DataManager contains the code for downloading and setting up numpy arrays containing the data.
+
+Step 2: Reshape & Format
+------------------------
+
+Usually data does not come in formats that perfectly fits the algorithms, to make this tutorial more
+realistic some data preprocessing is required to change the input to fit.
+
+First the Training data, X, has multiple dimensions resulting in a shape (60000, 28, 28).
+The dimensions correspond to first the number of images 60000, then the number of row pixels, 28,
+and finally the column pixels, 28.
+
+To use this data for Logistic Regression we have to reduce the dimensions.
+The input X is the training data. 
+It require the data to have two dimensions, the first resemble the
+number of inputs, and the other the number of features.
+
+Therefore to make the data fit the algorithm we reshape the X dataset, like so::
+
+    X = X.reshape((60000, 28*28))
+
+This takes each row of pixels and append to each other making a single feature vector per image.
+
+The Y dataset also does not perfectly fit the Logistic Regression algorithm, this is because the labels
+for this dataset is values ranging from 0, to 9, each label correspond to the integer shown in the image.
+unfortunately the algorithm require the labels to be distinct integers from 1 and upwards.
+
+Therefore we add 1 to each label such that the labels go from 1 to 10, like this::
+
+    Y = Y + 1
+
+With these steps we are now ready to train a simple logistic model.
+
+Step 3: Training
+----------------
+
+To start with, we setup a SystemDS context::
+
+    from systemds.context import SystemDSContext
+    sds = SystemDSContext()
+
+Then setup the data::
+
+    from systemds.matrix import Matrix
+    X_ds = Matrix(sds, X)
+    Y_ds = Matrix(sds, Y)
+
+to reduce the training time and verify everything works, it is usually good to reduce the amount of data,
+to train on a smaller sample to start with::
+
+    sample_size = 1000
+    X_ds = Matrix(sds, X[:sample_size])
+    Y_ds = Matrix(sds, Y[:sample_size])
+
+And now everything is ready for our algorithm::
+
+    from systemds.operator.algorithm import multiLogReg
+
+    bias = multiLogReg(X_ds, Y_ds)
+
+Note that nothing has been calculated yet, in SystemDS, since it only happens when you call compute::
+
+    bias_r = bias.compute()
 
-TODO
+bias is a matrix, that if matrix multiplied with an instance returns a value distribution where, the highest value is the predicted type.
+This is the matrix that could be saved and used for predicting labels later.
 
 Step 3: Validate
 ----------------
 
-TODO
+To see what accuracy the model achieves, we have to load in the test dataset as well.
+
+this can also be extracted from our builtin MNIST loader, to keep the tutorial short the operations are combined::
+
+    Xt = Matrix(sds, d.get_test_data().reshape((10000, 28*28)))
+    Yt = Matrix(sds, d.get_test_labels()) + 1
+
+The above loads the test data, and reshapes the X data the same way the training data was reshaped.
+
+Finally we verify the accuracy by calling::
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [m, y_pred, acc] = multiLogRegPredict(Xt, bias, Yt).compute()
+    print(acc)
+
+There are three outputs from the multiLogRegPredict call.
+
+m, is the mean probability of correctly classifying each label.
+y_pred, is the predictions made using the model, bias, trained.
+acc, is the accuracy achieved by the model.
+
+If the subset of the training data is used then you could expect an accuracy of 85% in this example
+using 1000 pictures of the training data.
+
+Step 4: Tuning
+--------------
+
+Now that we have a working baseline we can start tuning parameters.
+
+But first it is valuable to know how much of a difference in performance there is on the training data, vs the test data.
+This gives an indication of if we have exhausted the learning potential of the training data.
+
+To see how our accuracy is on the training data we use the Predict function again, but with our training data::
+
+    [m, y_pred, acc] = multiLogRegPredict(X_ds, bias, Y_ds).compute()
+    print(acc)
+
+In this specific case we achieve 100% accuracy on the training data, indicating that we have fit the training data,
+and have nothing more to learn from the data as it is now.
+
+To improve further we have to increase the training data, here for example we increase it
+from our sample of 1k to the full training dataset of 60k, in this example the maxi is set to reduce the number of iterations the algorithm takes,
+to again reduce training time::
+
+    X_ds = Matrix(sds, X)
+    Y_ds = Matrix(sds, Y)
+
+    bias = multiLogReg(X_ds, Y_ds, maxi=30)
+
+    [_, _, train_acc] = multiLogRegPredict(X_ds, bias, Y_ds).compute()
+    [_, _, test_acc] = multiLogRegPredict(Xt, bias, Yt).compute()
+    print(train_acc  "  " test_acc)

Review comment:
       This is invalid syntax. I think it should be: 
   `print(train_acc, "   ", test_acc)`

##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,173 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset.
+For simplicity the dataset used for this is `MNIST <http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one wants to skip the explanation then the full script is available at the bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+SystemDS provides builtin for downloading and setup of the MNIST dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()
+    Y = d.get_train_labels()
+
+Here the DataManager contains the code for downloading and setting up numpy arrays containing the data.
+
+Step 2: Reshape & Format
+------------------------
+
+Usually data does not come in formats that perfectly fits the algorithms, to make this tutorial more
+realistic some data preprocessing is required to change the input to fit.
+
+First the Training data, X, has multiple dimensions resulting in a shape (60000, 28, 28).
+The dimensions correspond to first the number of images 60000, then the number of row pixels, 28,
+and finally the column pixels, 28.
+
+To use this data for Logistic Regression we have to reduce the dimensions.
+The input X is the training data. 
+It require the data to have two dimensions, the first resemble the
+number of inputs, and the other the number of features.
+
+Therefore to make the data fit the algorithm we reshape the X dataset, like so::
+
+    X = X.reshape((60000, 28*28))
+
+This takes each row of pixels and append to each other making a single feature vector per image.
+
+The Y dataset also does not perfectly fit the Logistic Regression algorithm, this is because the labels
+for this dataset is values ranging from 0, to 9, each label correspond to the integer shown in the image.
+unfortunately the algorithm require the labels to be distinct integers from 1 and upwards.
+
+Therefore we add 1 to each label such that the labels go from 1 to 10, like this::
+
+    Y = Y + 1
+
+With these steps we are now ready to train a simple logistic model.
+
+Step 3: Training
+----------------
+
+To start with, we setup a SystemDS context::
+
+    from systemds.context import SystemDSContext
+    sds = SystemDSContext()
+
+Then setup the data::
+
+    from systemds.matrix import Matrix
+    X_ds = Matrix(sds, X)
+    Y_ds = Matrix(sds, Y)
+
+to reduce the training time and verify everything works, it is usually good to reduce the amount of data,
+to train on a smaller sample to start with::
+
+    sample_size = 1000
+    X_ds = Matrix(sds, X[:sample_size])
+    Y_ds = Matrix(sds, Y[:sample_size])
+
+And now everything is ready for our algorithm::
+
+    from systemds.operator.algorithm import multiLogReg
+
+    bias = multiLogReg(X_ds, Y_ds)
+
+Note that nothing has been calculated yet, in SystemDS, since it only happens when you call compute::
+
+    bias_r = bias.compute()
 
-TODO
+bias is a matrix, that if matrix multiplied with an instance returns a value distribution where, the highest value is the predicted type.
+This is the matrix that could be saved and used for predicting labels later.
 
 Step 3: Validate
 ----------------
 
-TODO
+To see what accuracy the model achieves, we have to load in the test dataset as well.
+
+this can also be extracted from our builtin MNIST loader, to keep the tutorial short the operations are combined::
+
+    Xt = Matrix(sds, d.get_test_data().reshape((10000, 28*28)))
+    Yt = Matrix(sds, d.get_test_labels()) + 1
+
+The above loads the test data, and reshapes the X data the same way the training data was reshaped.
+
+Finally we verify the accuracy by calling::
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [m, y_pred, acc] = multiLogRegPredict(Xt, bias, Yt).compute()
+    print(acc)
+
+There are three outputs from the multiLogRegPredict call.
+
+m, is the mean probability of correctly classifying each label.
+y_pred, is the predictions made using the model, bias, trained.
+acc, is the accuracy achieved by the model.

Review comment:
       This is displayed as one line when opening in IntelliJ. Is there an option to make it appear as an actual list?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard commented on pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

Baunsgaard commented on pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#issuecomment-696803217


   This should be ready for review now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] sebwrede commented on pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Posted by GitBox <gi...@apache.org>.

sebwrede commented on pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#issuecomment-697231659


   The tests are failing, @Baunsgaard.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org