You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2021/06/23 13:19:31 UTC

[GitHub] [systemds] Gandagorn opened a new pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Gandagorn opened a new pull request #1323:
URL: https://github.com/apache/systemds/pull/1323


   In this PR we implement a Python End-to-end tutorial for a standard ML classification task using the adult dataset.
   The preprocessing of the data includes 
   - one-hot-encoding
   - missing value imputation
   - outlier removal
   - scaling
   
   Classification is done using logistic regression and a neural network. 
   The results on the test set are evaluated using various metrics and a confusion matrix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Gandagorn commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Gandagorn commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r663786508



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -387,6 +387,11 @@ def test_level2(self):
         ################################################################################################################
         X1, M1 = X1.transform_encode(spec=jspec)
 
+        # better alternative for encoding
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+        # testX2 = X2.compute(True)

Review comment:
       @Baunsgaard
   Hi, we tried to implement a better version of the encoding using transform_apply, because otherwise we would calculate statistics for imputation on the whole data instead of just the training data.
   Unfortunately we ran into the problem that the labels in the train data are slightly different than the labels in the test data ("<=50K" != "<=50K."), which hinders us in using the encoding M for the test data. We tried different methods for correcting the labels in the test data before encoding, however the main problem is that we have not found a good way to apply changes in a column of a frame, and we are also unable to use frames as arguments for a dml script function (it seems to be not supported yet with python?).
   The simplest solution would be to remove the "." at the end of the test labels in the file itself. Other possible workarounds using just systemds functions seem to be quite complex and would probably miss the main goal of this tutorial.
   How should we proceed?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Gandagorn commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Gandagorn commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r679146344



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,324 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode, Matrix, Frame
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    preprocess_src_path: str = "tests/examples/tutorials/preprocess.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()
+
+
+
+    def test_level1(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True,
+                                                                                           standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy(train_data[:train_count])
+        Y = self.sds.from_numpy(train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+        self.assertGreater(acc, 80) #Todo remove?
+        # todo add text how high acc should be with this config
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+        # todo print confusion matrix? Explain cm?
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583, 502],
+                          [245, 670]])
+            )
+        )
+
+    def test_level2(self):
+
+        train_count = 32561
+        test_count = 16281
+
+        SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+        F1 = self.sds.read(
+            self.dataset_path_train,
+            schema=SCHEMA
+        )
+        F2 = self.sds.read(
+            self.dataset_path_test,
+            schema=SCHEMA
+        )
+
+        jspec = self.sds.read(self.dataset_jspec, data_type="scalar", value_type="string")
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, "preprocess", print_imported_methods=True)
+
+        X1 = F1.rbind(F2)
+        X1, M1 = X1.transform_encode(spec=jspec)
+
+        X = PREPROCESS_package.get_X(X1, 1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+        Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+        Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+        Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+        Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+        # better alternative for encoding. This was intended, but it does not work

Review comment:
       Yes. The replace function you made for us to clean the target labels works for one label correctly, however not for the other (See our discussion that started on 5.7.).
   The code that is commented out is how it should be done, using the replace function and the transform_apply function. The not-commented code is a temporary, working solution, but it is not optimal.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] j143 commented on pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
j143 commented on pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#issuecomment-869188396


   @Baunsgaard - Is it possible to keep datasets in the website instead of committing to this repo.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r663369719



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,243 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+
+    def test_multi_log_reg_interpolated_standardized(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+        
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583,  502],
+                         [245,  670]])
+            )
+        )
+
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()

Review comment:
       not yet.
   i did not run it , will do next week




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r667795868



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -386,51 +386,35 @@ def test_level2(self):
 
         """""
         ################################################################################################################
-        X1, M1 = X1.transform_encode(spec=jspec).compute()
+        X1, M1 = X1.transform_encode(spec=jspec)
 
         ################################################################################################################
         """"
-        First we re-split out data into a training and a test set with the corresponding labels. We can then simply transform
-        the numpy array of the training data back to SystemDS matrix by using "sds.from_numpy()". 
-        The SystemDS scale function takes a matrix as an input and returns three output parameters:
-            # Y            Matrix    ---      Output feature matrix with K columns
-            # ColMean      Matrix    ---      The column means of the input, subtracted if Center was TRUE
-            # ScaleFactor  Matrix    ---      The Scaling of the values, to make each dimension have similar value ranges
-        If we want to retransform a SystemDs Matrix to a Numpy array we can do so by using the np.array() function. 
+        First we re-split out data into a training and a test set with the corresponding labels. 
         """""
         ################################################################################################################
-        col_length = len(X1[0])
-        X = X1[0:train_count, 0:col_length - 1]
-        Y = X1[0:train_count, col_length - 1:col_length].flatten()
-        # Test data
-        Xt = X1[train_count:train_count + test_count, 0:col_length - 1]
-        Yt = X1[train_count:train_count + test_count, col_length - 1:col_length].flatten()
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, "preprocess", print_imported_methods=True)
 
+        X = PREPROCESS_package.get_X(X1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, train_count)
+        #We lose the column count information after using the Preprocess Package. This triggers an error on multilogregpredict. Otherwise its working
+        Xt = self.sds.from_numpy(np.array(PREPROCESS_package.get_Xt(X1, train_count).compute()))

Review comment:
       if i understand correctly.
   
   after you make anything into a matrix in the system like X and Y, you don't know how many columns and rows there is.
   this is correct, since materializing the column and row count in the python API would require us to do processing, that we only evaluate after compute.
   once you have the result back from compute you should have the correct number of columns and rows in numpy, but all intermediates you should not know.

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -387,6 +387,11 @@ def test_level2(self):
         ################################################################################################################
         X1, M1 = X1.transform_encode(spec=jspec)
 
+        # better alternative for encoding
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+        # testX2 = X2.compute(True)

Review comment:
       two things here.
   
   1. Yes, if there is a different number of labels in the training vs the test data, then the transform_apply does not work, currently there is no way around this. Do you really have 50k different classes in one of the features? I think you might be using a wrong encoding scheme for some of the columns.
   2. Frames should be supported, but they are very new so there are bound to be bugs, the function definitions should specify frame if the input type is frame otherwise you should not call a function with frames, could you tell me which function you are trying to call, then i can try to fix it if there is a bug?

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -387,6 +387,11 @@ def test_level2(self):
         ################################################################################################################
         X1, M1 = X1.transform_encode(spec=jspec)
 
+        # better alternative for encoding
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+        # testX2 = X2.compute(True)

Review comment:
       1. Okay now i understand the problem... a classic ... "someone made an error when making the dataset"...
   2. Since you have this issue i just extended the frame to support the replace operation simply use
   
   replace(target=X, pattern="<=50K.", replacement="<=50K")
   

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -387,6 +387,11 @@ def test_level2(self):
         ################################################################################################################
         X1, M1 = X1.transform_encode(spec=jspec)
 
+        # better alternative for encoding
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+        # testX2 = X2.compute(True)

Review comment:
       well i guess, i did not add it to the python API... will do.
   

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -387,6 +387,11 @@ def test_level2(self):
         ################################################################################################################
         X1, M1 = X1.transform_encode(spec=jspec)
 
+        # better alternative for encoding
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+        # testX2 = X2.compute(True)

Review comment:
       Should be there now ... if you have a matrix or a frame simply call .replace("target","pattern")




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] martinhofwe edited a comment on pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
martinhofwe edited a comment on pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#issuecomment-875812011


   Hi @Baunsgaard, 
   we added the lvl1 and lvl2 rst code. Should we split the different levels into different files or should they all be in one file?  If they are all in the same file should we only comment on the new code bits in each progressive level or should every level be self-explanatory with the comments?
   BR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r658585499



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,243 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+
+    def test_multi_log_reg_interpolated_standardized(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+        
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583,  502],
+                         [245,  670]])
+            )
+        )
+
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()
+
+
+    def test_parse_dataset_with_systemdsread(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 30000
+        test_count = 10000
+        #self.sds.read(self.dataset_path_train, schema=self.dataset_path_train_mtd).compute(verbose=True)
+        print("")
+
+        SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+        F1 = self.sds.read(
+            self.dataset_path_train,
+            schema=SCHEMA
+        )
+        F2 = self.sds.read(
+            self.dataset_path_test,
+            schema=SCHEMA
+        )
+        jspec = self.sds.read(self.dataset_jspec, data_type="scalar", value_type="string")
+        #scaling does not have effect yet. We need to replace labels in test set with the same string as in train dataset
+        X1, M1 = F1.rbind(F2).transform_encode(spec=jspec).compute()

Review comment:
       you should not need to compute here.
   use the output, X1 and call our builtin split

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,243 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+
+    def test_multi_log_reg_interpolated_standardized(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+        
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583,  502],
+                         [245,  670]])
+            )
+        )
+
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()
+
+
+    def test_parse_dataset_with_systemdsread(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 30000
+        test_count = 10000
+        #self.sds.read(self.dataset_path_train, schema=self.dataset_path_train_mtd).compute(verbose=True)
+        print("")
+
+        SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+        F1 = self.sds.read(
+            self.dataset_path_train,
+            schema=SCHEMA
+        )
+        F2 = self.sds.read(
+            self.dataset_path_test,
+            schema=SCHEMA
+        )
+        jspec = self.sds.read(self.dataset_jspec, data_type="scalar", value_type="string")
+        #scaling does not have effect yet. We need to replace labels in test set with the same string as in train dataset
+        X1, M1 = F1.rbind(F2).transform_encode(spec=jspec).compute()
+        col_length = len(X1[0])
+        X = X1[0:train_count, 0:col_length-1]
+        Y = X1[0:train_count, col_length-1:col_length].flatten()-1
+        # Test data
+        Xt = X1[train_count:train_count+test_count, 0:col_length-1]
+        Yt = X1[train_count:train_count+test_count, col_length-1:col_length].flatten()-1
+
+        _ , mean , sigma = scale(self.sds.from_numpy(X), True, True).compute()

Review comment:
       here also we should not need to compute, but simply get the values out, and call apply on the test data.
   
   The first result from this call is the scaled and shifted X, therefore the following second scaling of x is not needed.

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,243 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+
+    def test_multi_log_reg_interpolated_standardized(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+        
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583,  502],
+                         [245,  670]])
+            )
+        )
+
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()

Review comment:
       i will make this work.

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,243 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+
+    def test_multi_log_reg_interpolated_standardized(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+        
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583,  502],
+                         [245,  670]])
+            )
+        )
+
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()
+
+
+    def test_parse_dataset_with_systemdsread(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 30000
+        test_count = 10000
+        #self.sds.read(self.dataset_path_train, schema=self.dataset_path_train_mtd).compute(verbose=True)
+        print("")
+
+        SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+        F1 = self.sds.read(
+            self.dataset_path_train,
+            schema=SCHEMA
+        )
+        F2 = self.sds.read(
+            self.dataset_path_test,
+            schema=SCHEMA
+        )
+        jspec = self.sds.read(self.dataset_jspec, data_type="scalar", value_type="string")
+        #scaling does not have effect yet. We need to replace labels in test set with the same string as in train dataset
+        X1, M1 = F1.rbind(F2).transform_encode(spec=jspec).compute()
+        col_length = len(X1[0])
+        X = X1[0:train_count, 0:col_length-1]
+        Y = X1[0:train_count, col_length-1:col_length].flatten()-1
+        # Test data
+        Xt = X1[train_count:train_count+test_count, 0:col_length-1]
+        Yt = X1[train_count:train_count+test_count, col_length-1:col_length].flatten()-1
+
+        _ , mean , sigma = scale(self.sds.from_numpy(X), True, True).compute()
+
+        mean_copy = np.array(mean)
+        sigma_copy = np.array(sigma)
+
+        numerical_cols = []
+        for count, col in enumerate(np.transpose(X)):
+            for entry in col:
+                if entry > 1 or entry < 0 or entry > 0 and entry < 1:
+                    numerical_cols.append(count)
+                    break
+
+        for x in range(0,105):
+            if not x in numerical_cols:
+                mean_copy[0][x] = 0
+                sigma_copy[0][x] = 1
+
+        mean = self.sds.from_numpy(mean_copy)
+        sigma = self.sds.from_numpy(sigma_copy)
+        X = self.sds.from_numpy(X)
+        Xt = self.sds.from_numpy(Xt)
+        X = scaleApply(winsorize(X, True), mean, sigma)
+        Xt = scaleApply(winsorize(Xt, True), mean, sigma)
+        #node = PROCESSING_split_package.m_split(X1,X1)
+        #X,Y = node.compute()
+
+        # Train data
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+
+        network = FFN_package.train(X, self.sds.from_numpy(Y), 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        # FFN_package.eval(Yt, Yt).compute()"""

Review comment:
       i will make this work.

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,243 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+
+    def test_multi_log_reg_interpolated_standardized(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+        
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583,  502],
+                         [245,  670]])
+            )
+        )
+
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()
+
+
+    def test_parse_dataset_with_systemdsread(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 30000
+        test_count = 10000
+        #self.sds.read(self.dataset_path_train, schema=self.dataset_path_train_mtd).compute(verbose=True)
+        print("")
+
+        SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+        F1 = self.sds.read(
+            self.dataset_path_train,
+            schema=SCHEMA
+        )
+        F2 = self.sds.read(
+            self.dataset_path_test,
+            schema=SCHEMA
+        )
+        jspec = self.sds.read(self.dataset_jspec, data_type="scalar", value_type="string")
+        #scaling does not have effect yet. We need to replace labels in test set with the same string as in train dataset
+        X1, M1 = F1.rbind(F2).transform_encode(spec=jspec).compute()
+        col_length = len(X1[0])
+        X = X1[0:train_count, 0:col_length-1]
+        Y = X1[0:train_count, col_length-1:col_length].flatten()-1
+        # Test data
+        Xt = X1[train_count:train_count+test_count, 0:col_length-1]
+        Yt = X1[train_count:train_count+test_count, col_length-1:col_length].flatten()-1
+
+        _ , mean , sigma = scale(self.sds.from_numpy(X), True, True).compute()
+
+        mean_copy = np.array(mean)
+        sigma_copy = np.array(sigma)
+
+        numerical_cols = []
+        for count, col in enumerate(np.transpose(X)):
+            for entry in col:
+                if entry > 1 or entry < 0 or entry > 0 and entry < 1:
+                    numerical_cols.append(count)
+                    break
+
+        for x in range(0,105):
+            if not x in numerical_cols:
+                mean_copy[0][x] = 0
+                sigma_copy[0][x] = 1
+
+        mean = self.sds.from_numpy(mean_copy)
+        sigma = self.sds.from_numpy(sigma_copy)
+        X = self.sds.from_numpy(X)
+        Xt = self.sds.from_numpy(Xt)
+        X = scaleApply(winsorize(X, True), mean, sigma)

Review comment:
       also, maybe apply winsorize before getting the scaling? (i'm not sure what the best practice here is)

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,243 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+
+    def test_multi_log_reg_interpolated_standardized(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+        
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583,  502],
+                         [245,  670]])
+            )
+        )
+
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()
+
+
+    def test_parse_dataset_with_systemdsread(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 30000
+        test_count = 10000
+        #self.sds.read(self.dataset_path_train, schema=self.dataset_path_train_mtd).compute(verbose=True)
+        print("")
+
+        SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+        F1 = self.sds.read(
+            self.dataset_path_train,
+            schema=SCHEMA
+        )
+        F2 = self.sds.read(
+            self.dataset_path_test,
+            schema=SCHEMA
+        )
+        jspec = self.sds.read(self.dataset_jspec, data_type="scalar", value_type="string")
+        #scaling does not have effect yet. We need to replace labels in test set with the same string as in train dataset
+        X1, M1 = F1.rbind(F2).transform_encode(spec=jspec).compute()
+        col_length = len(X1[0])
+        X = X1[0:train_count, 0:col_length-1]
+        Y = X1[0:train_count, col_length-1:col_length].flatten()-1
+        # Test data
+        Xt = X1[train_count:train_count+test_count, 0:col_length-1]
+        Yt = X1[train_count:train_count+test_count, col_length-1:col_length].flatten()-1
+
+        _ , mean , sigma = scale(self.sds.from_numpy(X), True, True).compute()
+
+        mean_copy = np.array(mean)
+        sigma_copy = np.array(sigma)
+
+        numerical_cols = []
+        for count, col in enumerate(np.transpose(X)):
+            for entry in col:
+                if entry > 1 or entry < 0 or entry > 0 and entry < 1:
+                    numerical_cols.append(count)
+                    break
+
+        for x in range(0,105):
+            if not x in numerical_cols:
+                mean_copy[0][x] = 0
+                sigma_copy[0][x] = 1
+
+        mean = self.sds.from_numpy(mean_copy)
+        sigma = self.sds.from_numpy(sigma_copy)
+        X = self.sds.from_numpy(X)
+        Xt = self.sds.from_numpy(Xt)
+        X = scaleApply(winsorize(X, True), mean, sigma)

Review comment:
       second scaling not needed. if you use the output from the first.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#issuecomment-869197903


   > @Baunsgaard - Is it possible to keep datasets in the website instead of committing to this repo.
   
   Yes. Will do when we get to merging. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Gandagorn commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Gandagorn commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r668004905



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -387,6 +387,11 @@ def test_level2(self):
         ################################################################################################################
         X1, M1 = X1.transform_encode(spec=jspec)
 
+        # better alternative for encoding
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+        # testX2 = X2.compute(True)

Review comment:
       1. There are only 2 different labels to predict, if a person makes "<=50K" a year or ">50K" a year. The 2 labels in the training set are denoted correctly as  either "<=50K" and ">50K", however in the test set they are called "<=50K."  and ">50K.". The dot at the end of the test set labels hinders us to use the transform_apply function correctly.
   2. The dml function we want to use should be able to replace a string in the target column of the frame, similar to 
   
   ```
   replace_target_frame = function(String replacement, String to_replace, Frame[Unknown] X)
     return(Frame[Unknown] X)
   {
     for (i in 1:nrow(X)) {
       if (as.scalar(X[i, ncol(X)]) == to_replace) {
         X[i, ncol(X)] = replacement;
       }
     }
   }
   ```
   
   However when trying to load the function into python, we get the error 
   > "NotImplementedError: Not Implemented type parsing for function def: Frame[Unknown]X"
   
   
   
   
   

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -387,6 +387,11 @@ def test_level2(self):
         ################################################################################################################
         X1, M1 = X1.transform_encode(spec=jspec)
 
+        # better alternative for encoding
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+        # testX2 = X2.compute(True)

Review comment:
       Thank you very much! Still I am confused how to use the replace function in the python environment? Because we load the data in python into a frame, and when trying to use `sds.source` with a dml file that includes a function with a frame as an argument, it throws the above error.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#issuecomment-878149112


   > Hi @Baunsgaard,
   > we added the lvl1 and lvl2 rst code. Should we split the different levels into different files or should they all be in one file? If they are all in the same file should we only comment on the new code bits in each progressive level or should every level be self-explanatory with the comments?
   > BR
   
   Your choice, do what makes the most sense to you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] codeyeeter commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
codeyeeter commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r663759655



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -386,51 +386,35 @@ def test_level2(self):
 
         """""
         ################################################################################################################
-        X1, M1 = X1.transform_encode(spec=jspec).compute()
+        X1, M1 = X1.transform_encode(spec=jspec)
 
         ################################################################################################################
         """"
-        First we re-split out data into a training and a test set with the corresponding labels. We can then simply transform
-        the numpy array of the training data back to SystemDS matrix by using "sds.from_numpy()". 
-        The SystemDS scale function takes a matrix as an input and returns three output parameters:
-            # Y            Matrix    ---      Output feature matrix with K columns
-            # ColMean      Matrix    ---      The column means of the input, subtracted if Center was TRUE
-            # ScaleFactor  Matrix    ---      The Scaling of the values, to make each dimension have similar value ranges
-        If we want to retransform a SystemDs Matrix to a Numpy array we can do so by using the np.array() function. 
+        First we re-split out data into a training and a test set with the corresponding labels. 
         """""
         ################################################################################################################
-        col_length = len(X1[0])
-        X = X1[0:train_count, 0:col_length - 1]
-        Y = X1[0:train_count, col_length - 1:col_length].flatten()
-        # Test data
-        Xt = X1[train_count:train_count + test_count, 0:col_length - 1]
-        Yt = X1[train_count:train_count + test_count, col_length - 1:col_length].flatten()
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, "preprocess", print_imported_methods=True)
 
+        X = PREPROCESS_package.get_X(X1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, train_count)
+        #We lose the column count information after using the Preprocess Package. This triggers an error on multilogregpredict. Otherwise its working
+        Xt = self.sds.from_numpy(np.array(PREPROCESS_package.get_Xt(X1, train_count).compute()))

Review comment:
       @Baunsgaard 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] martinhofwe commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
martinhofwe commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r679137870



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,324 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode, Matrix, Frame
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    preprocess_src_path: str = "tests/examples/tutorials/preprocess.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()
+
+
+
+    def test_level1(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True,
+                                                                                           standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy(train_data[:train_count])
+        Y = self.sds.from_numpy(train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+        self.assertGreater(acc, 80) #Todo remove?
+        # todo add text how high acc should be with this config
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+        # todo print confusion matrix? Explain cm?
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583, 502],
+                          [245, 670]])
+            )
+        )
+
+    def test_level2(self):
+
+        train_count = 32561
+        test_count = 16281
+
+        SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+        F1 = self.sds.read(
+            self.dataset_path_train,
+            schema=SCHEMA
+        )
+        F2 = self.sds.read(
+            self.dataset_path_test,
+            schema=SCHEMA
+        )
+
+        jspec = self.sds.read(self.dataset_jspec, data_type="scalar", value_type="string")
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, "preprocess", print_imported_methods=True)
+
+        X1 = F1.rbind(F2)
+        X1, M1 = X1.transform_encode(spec=jspec)
+
+        X = PREPROCESS_package.get_X(X1, 1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+        Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+        Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+        Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+        Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+        # better alternative for encoding. This was intended, but it does not work
+        #F2 = F2.replace("<=50K.", "<=50K")
+        #F2 = F2.replace(">50K.", ">50K")
+        #X1, M = F1.transform_encode(spec=jspec)
+        #X2 = F2.transform_apply(spec=jspec, meta=M)
+
+        #X = PREPROCESS_package.get_X(X1, 1, train_count)
+        #Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+        #Xt = PREPROCESS_package.get_X(X2, 1, test_count)
+        #Yt = PREPROCESS_package.get_Y(X2, 1, test_count)
+
+        # TODO somehow throws error at predict with this included
+        #X, mean, sigma = scale(X, True, True)
+        #Xt = scaleApply(Xt, mean, sigma)
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+        confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+        print(confusion_matrix_abs)
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[11593.,  1545.],
+                          [842., 2302.]])
+            )
+        )
+
+    def test_level3(self):
+        train_count = 32561
+        test_count = 16281
+
+        SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+        F1 = self.sds.read(
+            self.dataset_path_train,
+            schema=SCHEMA
+        )
+        F2 = self.sds.read(
+            self.dataset_path_test,
+            schema=SCHEMA
+        )
+
+        jspec = self.sds.read(self.dataset_jspec, data_type="scalar", value_type="string")
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, "preprocess", print_imported_methods=True)
+
+        X1 = F1.rbind(F2)
+        X1, M1 = X1.transform_encode(spec=jspec)
+
+        X = PREPROCESS_package.get_X(X1, 1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+        Xt = PREPROCESS_package.get_X(X1, train_count, train_count + test_count)
+        Yt = PREPROCESS_package.get_Y(X1, train_count, train_count + test_count)
+
+        Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+        Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+        # better alternative for encoding
+        # F2 = F2.replace("<=50K.", "<=50K")
+        # F2 = F2.replace(">50K.", ">50K")
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+
+        # X = PREPROCESS_package.get_X(X1, 1, train_count)
+        # Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+        # Xt = PREPROCESS_package.get_X(X2, 1, test_count)
+        # Yt = PREPROCESS_package.get_Y(X2, 1, test_count)
+
+        # TODO somehow throws error at predict with this included
+        # X, mean, sigma = scale(X, True, True)
+        # Xt = scaleApply(Xt, mean, sigma)
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        epochs = 1
+        batch_size = 16
+        learning_rate = 0.01
+        seed = 42
+
+        network = FFN_package.train(X, Y, epochs, batch_size, learning_rate, seed)
+
+        """
+        If more ressources are available, one can also choose to train the model using a parameter server.
+        Here we use the same parameters as before, however we need to specifiy a few more.
+        """
+        ################################################################################################################
+        # workers = 1
+        # utype = '"BSP"'
+        # freq = '"EPOCH"'
+        # mode = '"LOCAL"'
+        # network = FFN_package.train_paramserv(X, Y, epochs,
+        #                                       batch_size, learning_rate, workers, utype, freq, mode,
+        #                                       seed)
+        ################################################################################################################
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        """
+        Next we evaluate our network on the test set which was not used for training.
+        The predict function with the test features and our trained network returns a matrix of class probabilities.
+        This matrix contains for each test sample the probabilities for each class.
+        For predicting the most likely class of a sample, we choose the class with the highest probability.
+        """
+        ################################################################################################################
+        #probs = FFN_package.predict(Xt, network)

Review comment:
       Yes, correct (predict seems to be the problem).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] martinhofwe commented on pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
martinhofwe commented on pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#issuecomment-875812011


   Hi, 
   we added the lvl1 and lvl2 rst code. Should we split the different levels into different files or should they all be in one file?  If they are all in the same file should we only comment on the new code bits in each progressive level or should every level be self-explanatory with the comments?
   BR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] codeyeeter commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
codeyeeter commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r663380211



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -386,51 +386,35 @@ def test_level2(self):
 
         """""
         ################################################################################################################
-        X1, M1 = X1.transform_encode(spec=jspec).compute()
+        X1, M1 = X1.transform_encode(spec=jspec)
 
         ################################################################################################################
         """"
-        First we re-split out data into a training and a test set with the corresponding labels. We can then simply transform
-        the numpy array of the training data back to SystemDS matrix by using "sds.from_numpy()". 
-        The SystemDS scale function takes a matrix as an input and returns three output parameters:
-            # Y            Matrix    ---      Output feature matrix with K columns
-            # ColMean      Matrix    ---      The column means of the input, subtracted if Center was TRUE
-            # ScaleFactor  Matrix    ---      The Scaling of the values, to make each dimension have similar value ranges
-        If we want to retransform a SystemDs Matrix to a Numpy array we can do so by using the np.array() function. 
+        First we re-split out data into a training and a test set with the corresponding labels. 
         """""
         ################################################################################################################
-        col_length = len(X1[0])
-        X = X1[0:train_count, 0:col_length - 1]
-        Y = X1[0:train_count, col_length - 1:col_length].flatten()
-        # Test data
-        Xt = X1[train_count:train_count + test_count, 0:col_length - 1]
-        Yt = X1[train_count:train_count + test_count, col_length - 1:col_length].flatten()
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, "preprocess", print_imported_methods=True)
 
+        X = PREPROCESS_package.get_X(X1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, train_count)
+        #We lose the column count information after using the Preprocess Package. This triggers an error on multilogregpredict. Otherwise its working
+        Xt = self.sds.from_numpy(np.array(PREPROCESS_package.get_Xt(X1, train_count).compute()))

Review comment:
       We lose the column count information after splitting the matrix in a sourced dml file. Is there a way around this issue without relying on this pretty bad workaround?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] codeyeeter commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
codeyeeter commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r679151542



##########
File path: src/main/python/tests/manual_tests/multi_log_reg_adult.py
##########
@@ -0,0 +1,27 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+
+from systemds.context import SystemDSContext
+from systemds.operator.algorithm import multiLogReg, multiLogRegPredict
+from systemds.examples.tutorials.adult import DataManager
+
+d = DataManager()
+

Review comment:
       No, we initially wanted to add some testing functionality in that file. Seems like we missed deleting the file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard closed pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Baunsgaard closed pull request #1323:
URL: https://github.com/apache/systemds/pull/1323


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#issuecomment-919788569


   Thanks for the PR, i have now merged it.
   
   While merging i have moved the data to our webpage, and changed some of the tests.
   Thanks for the contribution.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Gandagorn commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Gandagorn commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r663368991



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,243 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+
+    def test_multi_log_reg_interpolated_standardized(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+        
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583,  502],
+                         [245,  670]])
+            )
+        )
+
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()

Review comment:
       @Baunsgaard Hi I was just wondering if there has been some progress with this yet?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] codeyeeter commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
codeyeeter commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r663759655



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -386,51 +386,35 @@ def test_level2(self):
 
         """""
         ################################################################################################################
-        X1, M1 = X1.transform_encode(spec=jspec).compute()
+        X1, M1 = X1.transform_encode(spec=jspec)
 
         ################################################################################################################
         """"
-        First we re-split out data into a training and a test set with the corresponding labels. We can then simply transform
-        the numpy array of the training data back to SystemDS matrix by using "sds.from_numpy()". 
-        The SystemDS scale function takes a matrix as an input and returns three output parameters:
-            # Y            Matrix    ---      Output feature matrix with K columns
-            # ColMean      Matrix    ---      The column means of the input, subtracted if Center was TRUE
-            # ScaleFactor  Matrix    ---      The Scaling of the values, to make each dimension have similar value ranges
-        If we want to retransform a SystemDs Matrix to a Numpy array we can do so by using the np.array() function. 
+        First we re-split out data into a training and a test set with the corresponding labels. 
         """""
         ################################################################################################################
-        col_length = len(X1[0])
-        X = X1[0:train_count, 0:col_length - 1]
-        Y = X1[0:train_count, col_length - 1:col_length].flatten()
-        # Test data
-        Xt = X1[train_count:train_count + test_count, 0:col_length - 1]
-        Yt = X1[train_count:train_count + test_count, col_length - 1:col_length].flatten()
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, "preprocess", print_imported_methods=True)
 
+        X = PREPROCESS_package.get_X(X1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, train_count)
+        #We lose the column count information after using the Preprocess Package. This triggers an error on multilogregpredict. Otherwise its working
+        Xt = self.sds.from_numpy(np.array(PREPROCESS_package.get_Xt(X1, train_count).compute()))

Review comment:
       @Baunsgaard 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r678175510



##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()

Review comment:
       small change here.
   i would like if you used 
   
   with SystemDSContext() as sds:
   
   and then indent.
   this is because then you guarantee that the sds context is closed after use.
   
   see https://apache.github.io/systemds/api/python/getting_started/simple_examples.html
   first example.

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+The multiLogRegPredict function has three return values:
+    - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example.
+    - y_pred, is the predictions made using the model
+    - acc, is the accuracy achieved by the model.
+
+Step 3: Confusion Matrix
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding
+which classes the model has difficulties separating.
+The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix
+for the predictions and the confusion matrix averages of each true class.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import confusionMatrix
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+In the full script, some steps are combined to reduce the overall script.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 2
+-------
+
+This part of the tutorial shows an overview of the preprocessing capabilities that SystemDS has to offer.
+We will take an unprocessed dataset using the csv format and read it with SystemDS. Then do the heavy lifting for the preprocessing with SystemDS.
+As mentioned before, we want to use the Adult dataset for this task.
+If one wants to skip the explanation, then the full script is available at the end of this level.
+
+Step 1: Metadata and reading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First of all, we need to download the dataset and create a mtd-file for specifying different metadata about the dataset.
+We download the train and test dataset from: https://archive.ics.uci.edu/ml/datasets/adult
+
+The downloaded dataset will be slightly modified for convenience. These modifications entail removing unnecessary newlines at the end of the files and
+adding column names at the top of the files such that the first line looks like:
+
+.. code-block::
+
+    age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
+
+We also delete the line holding the string value |1x3 Cross validator inside the test dataset.
+
+After these modifications, we have to define a mtd file for each file we want to read. This mtd file has to be in the same directory as the dataset.
+In this particular example, the dataset is split into two files "train_data.csv" and "test_data.csv". We want to read both, which means that we will define a mtd file for
+each of them. Those files will be called "train_data.csv.mtd" and "test_data.csv.mtd".
+In these files, we can define certain properties that the file has and also specify which values are supposed to get treated like missing values.
+
+The content of the train_data.csv.mtd file is:
+
+.. code-block::
+
+    {
+    "data_type": "frame",
+    "format": "csv",
+    "header": true,
+    "naStrings": [ "?", "" ],
+    "rows": 32561,
+    "cols": 15
+    }
+
+The "format" of the file is csv, and "header" is set to true because we added the feature names as headers to the csv files.
+The value "data_type" is set to frame, as the preprocessing functions that we use require this datatype.
+The value of "naStrings" is a list of all the string values that should be treated as unknown values during the preprocessing.
+Also, "rows" in our example is set to 32561, as we have this many entries and "cols" is set to 15 as we have 14 features, and one label column inside the files. We will later show how we can split them.
+
+After these requirements are completed, we have to define a SystemDSContext for reading our dataset. We can do this in the following way:
+
+.. code-block:: python
+
+    sds = SystemDSContext()
+
+    train_count = 32561
+    test_count = 16281
+
+With this context we can now define a read operation using the path of the dataset and a schema.
+The schema simply defines the data types for each column.
+
+As already mentioned, SystemDS supports lazy execution by default, which means that the read operation is only executed after calling the compute() function.
+
+.. code-block:: python
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+
+    F1 = sds.read(
+        dataset_path_train,
+        schema=SCHEMA
+    )
+    F2 = sds.read(
+        dataset_path_test,
+        schema=SCHEMA
+    )
+
+Step 2: Defining preprocess operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the read operation has been declared, we can define an additional file for the further preprocessing of the dataset.
+For this, we create a .json file that holds information about the operations that will be performed on individual columns.
+For the sake of this tutorial we will use the file "jspec.json" with the following content:
+
+.. code-block::
+
+    {
+    "impute":
+    [ { "name": "age", "method": "global_mean" }
+     ,{ "name": "workclass" , "method": "global_mode" }
+     ,{ "name": "fnlwgt", "method": "global_mean" }
+     ,{ "name": "education", "method": "global_mode"  }
+     ,{ "name": "education-num", "method": "global_mean" }
+     ,{ "name": "marital-status"      , "method": "global_mode" }
+     ,{ "name": "occupation"        , "method": "global_mode" }
+     ,{ "name": "relationship" , "method": "global_mode" }
+     ,{ "name": "race"        , "method": "global_mode" }
+     ,{ "name": "sex"        , "method": "global_mode" }
+     ,{ "name": "capital-gain", "method": "global_mean" }
+     ,{ "name": "capital-loss", "method": "global_mean" }
+     ,{ "name": "hours-per-week", "method": "global_mean" }
+     ,{ "name": "native-country"        , "method": "global_mode" }
+    ],
+    "bin": [ { "name": "age"  , "method": "equi-width", "numbins": 3 }],
+    "dummycode": ["age", "workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"],
+    "recode": ["income"]
+    }
+
+Our dataset has missing values. An easy way to deal with that circumstance is to use the "impute" option that SystemDS supports.
+We simply pass a list that holds all the relations between column names and the method of interpolation. A more specific example  is the "education" column.
+In the dataset certain entries have missing values for this column. As this is a string feature,
+we can simply define the method as "global_mode" and replace every missing value with the global mode inside this column. It is important to note that
+we first had to define the values of the missing strings in our selected dataset using the .mtd files (naStrings": [ "?", "" ]).
+
+With the "bin" keyword we can discretize continuous values into a small number of bins. Here the column with age values
+is discretized into three age intervals. The only method that is currently supported is equi-width binning.
+
+The column-level data transformation "dummycode" allows us to one-hot-encode a column.
+In our example we first bin the "age" column into 3 different bins. This means that we now have one column where one entry can belong to one of 3 age groups. After using
+"dummycode", we transform this one column into 3 different columns, one for each bin.
+
+At last, we make use of the "recode" transformation for categorical columns, it maps all distinct categories in
+the column into consecutive numbers, starting from 1. In our example we recode the "income" column, which
+transforms it from "<=$50K" and ">$50K" to "1" and "2" respectively.
+
+Another good resource for further ways of processing is: https://apache.github.io/systemds/site/dml-language-reference.html
+
+There we provide different examples for defining jspec's and what functionality is currently supported.
+
+After defining the .jspec file we can read it by passing the filepath, data_type and value_type using the following command:
+
+.. code-block:: python
+
+    dataset_jspec = "adult/jspec.json"
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+
+Finally, we need to define a custom dml file to split the features from the labels and replace certain values, which we will need later.
+We will call this file "preprocess.dml":
+
+.. code-block::
+
+    get_X = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {

Review comment:
       i would indent this line and the next

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+The multiLogRegPredict function has three return values:
+    - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example.
+    - y_pred, is the predictions made using the model
+    - acc, is the accuracy achieved by the model.
+
+Step 3: Confusion Matrix
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding
+which classes the model has difficulties separating.
+The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix
+for the predictions and the confusion matrix averages of each true class.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import confusionMatrix
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+In the full script, some steps are combined to reduce the overall script.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    sds = SystemDSContext()

Review comment:
       here again 
   
   with SystemDSContext() as sds:

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)

Review comment:
       i don't think the braces for [_, y_pred, acc] is needed here.
   

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+The multiLogRegPredict function has three return values:
+    - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example.
+    - y_pred, is the predictions made using the model
+    - acc, is the accuracy achieved by the model.
+
+Step 3: Confusion Matrix
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding
+which classes the model has difficulties separating.
+The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix
+for the predictions and the confusion matrix averages of each true class.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import confusionMatrix
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+In the full script, some steps are combined to reduce the overall script.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 2
+-------
+
+This part of the tutorial shows an overview of the preprocessing capabilities that SystemDS has to offer.
+We will take an unprocessed dataset using the csv format and read it with SystemDS. Then do the heavy lifting for the preprocessing with SystemDS.
+As mentioned before, we want to use the Adult dataset for this task.
+If one wants to skip the explanation, then the full script is available at the end of this level.
+
+Step 1: Metadata and reading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First of all, we need to download the dataset and create a mtd-file for specifying different metadata about the dataset.
+We download the train and test dataset from: https://archive.ics.uci.edu/ml/datasets/adult
+
+The downloaded dataset will be slightly modified for convenience. These modifications entail removing unnecessary newlines at the end of the files and
+adding column names at the top of the files such that the first line looks like:
+
+.. code-block::
+
+    age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
+
+We also delete the line holding the string value |1x3 Cross validator inside the test dataset.
+
+After these modifications, we have to define a mtd file for each file we want to read. This mtd file has to be in the same directory as the dataset.
+In this particular example, the dataset is split into two files "train_data.csv" and "test_data.csv". We want to read both, which means that we will define a mtd file for
+each of them. Those files will be called "train_data.csv.mtd" and "test_data.csv.mtd".
+In these files, we can define certain properties that the file has and also specify which values are supposed to get treated like missing values.
+
+The content of the train_data.csv.mtd file is:
+
+.. code-block::
+
+    {
+    "data_type": "frame",
+    "format": "csv",
+    "header": true,
+    "naStrings": [ "?", "" ],
+    "rows": 32561,
+    "cols": 15
+    }
+
+The "format" of the file is csv, and "header" is set to true because we added the feature names as headers to the csv files.
+The value "data_type" is set to frame, as the preprocessing functions that we use require this datatype.
+The value of "naStrings" is a list of all the string values that should be treated as unknown values during the preprocessing.
+Also, "rows" in our example is set to 32561, as we have this many entries and "cols" is set to 15 as we have 14 features, and one label column inside the files. We will later show how we can split them.
+
+After these requirements are completed, we have to define a SystemDSContext for reading our dataset. We can do this in the following way:
+
+.. code-block:: python
+
+    sds = SystemDSContext()
+
+    train_count = 32561
+    test_count = 16281
+
+With this context we can now define a read operation using the path of the dataset and a schema.
+The schema simply defines the data types for each column.
+
+As already mentioned, SystemDS supports lazy execution by default, which means that the read operation is only executed after calling the compute() function.
+
+.. code-block:: python
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+
+    F1 = sds.read(
+        dataset_path_train,
+        schema=SCHEMA
+    )
+    F2 = sds.read(
+        dataset_path_test,
+        schema=SCHEMA
+    )
+
+Step 2: Defining preprocess operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the read operation has been declared, we can define an additional file for the further preprocessing of the dataset.
+For this, we create a .json file that holds information about the operations that will be performed on individual columns.
+For the sake of this tutorial we will use the file "jspec.json" with the following content:
+
+.. code-block::
+
+    {
+    "impute":
+    [ { "name": "age", "method": "global_mean" }
+     ,{ "name": "workclass" , "method": "global_mode" }
+     ,{ "name": "fnlwgt", "method": "global_mean" }
+     ,{ "name": "education", "method": "global_mode"  }
+     ,{ "name": "education-num", "method": "global_mean" }
+     ,{ "name": "marital-status"      , "method": "global_mode" }
+     ,{ "name": "occupation"        , "method": "global_mode" }
+     ,{ "name": "relationship" , "method": "global_mode" }
+     ,{ "name": "race"        , "method": "global_mode" }
+     ,{ "name": "sex"        , "method": "global_mode" }
+     ,{ "name": "capital-gain", "method": "global_mean" }
+     ,{ "name": "capital-loss", "method": "global_mean" }
+     ,{ "name": "hours-per-week", "method": "global_mean" }
+     ,{ "name": "native-country"        , "method": "global_mode" }
+    ],
+    "bin": [ { "name": "age"  , "method": "equi-width", "numbins": 3 }],
+    "dummycode": ["age", "workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"],
+    "recode": ["income"]
+    }
+
+Our dataset has missing values. An easy way to deal with that circumstance is to use the "impute" option that SystemDS supports.
+We simply pass a list that holds all the relations between column names and the method of interpolation. A more specific example  is the "education" column.
+In the dataset certain entries have missing values for this column. As this is a string feature,
+we can simply define the method as "global_mode" and replace every missing value with the global mode inside this column. It is important to note that
+we first had to define the values of the missing strings in our selected dataset using the .mtd files (naStrings": [ "?", "" ]).
+
+With the "bin" keyword we can discretize continuous values into a small number of bins. Here the column with age values
+is discretized into three age intervals. The only method that is currently supported is equi-width binning.
+
+The column-level data transformation "dummycode" allows us to one-hot-encode a column.
+In our example we first bin the "age" column into 3 different bins. This means that we now have one column where one entry can belong to one of 3 age groups. After using
+"dummycode", we transform this one column into 3 different columns, one for each bin.
+
+At last, we make use of the "recode" transformation for categorical columns, it maps all distinct categories in
+the column into consecutive numbers, starting from 1. In our example we recode the "income" column, which
+transforms it from "<=$50K" and ">$50K" to "1" and "2" respectively.
+
+Another good resource for further ways of processing is: https://apache.github.io/systemds/site/dml-language-reference.html
+
+There we provide different examples for defining jspec's and what functionality is currently supported.
+
+After defining the .jspec file we can read it by passing the filepath, data_type and value_type using the following command:
+
+.. code-block:: python
+
+    dataset_jspec = "adult/jspec.json"
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+
+Finally, we need to define a custom dml file to split the features from the labels and replace certain values, which we will need later.
+We will call this file "preprocess.dml":
+
+.. code-block::
+
+    get_X = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,1:ncol(X)-1]
+    }
+    get_Y = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,ncol(X):ncol(X)]

Review comment:
       indent

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+The multiLogRegPredict function has three return values:
+    - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example.
+    - y_pred, is the predictions made using the model
+    - acc, is the accuracy achieved by the model.
+
+Step 3: Confusion Matrix
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding
+which classes the model has difficulties separating.
+The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix
+for the predictions and the confusion matrix averages of each true class.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import confusionMatrix
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)

Review comment:
       maybe include what it prints in the guide.

##########
File path: src/main/python/tests/examples/tutorials/preprocess.dml
##########
@@ -0,0 +1,46 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+get_X = function(matrix[double] X,
+                 int start, int stop)
+    return (matrix[double] returnVal) {
+  returnVal = X[start:stop,1:ncol(X)-1]
+}
+get_Y = function(matrix[double] X,
+                 int start, int stop)
+    return (matrix[double] returnVal) {
+  returnVal = X[start:stop,ncol(X):ncol(X)]
+}
+
+replace_value = function(matrix[double] X,
+                 double pattern , double replacement)
+    return (matrix[double] returnVal) {
+  returnVal = replace(target=X, pattern=pattern, replacement=replacement)
+}
+
+#replace_target_frame = function(String replacement, String to_replace, Frame[Unknown] X)
+#  return(Frame[Unknown] X)
+#{
+#  for (i in 1:nrow(X)) {
+#    if (as.scalar(X[i, ncol(X)]) == to_replace) {
+#      X[i, ncol(X)] = replacement;
+#    }
+#  }
+#}

Review comment:
       add newline

##########
File path: src/main/python/tests/examples/tutorials/neural_net_source.dml
##########
@@ -0,0 +1,217 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Imports
+source("nn/layers/affine.dml") as affine
+source("nn/layers/logcosh_loss.dml") as logcosh
+source("nn/layers/elu.dml") as elu
+source("nn/layers/sigmoid.dml") as sigmoid
+source("nn/optim/sgd.dml") as sgd
+
+init_model = function(Integer inputDimension, Integer outputDimension, int seed = -1)
+  return(list[unknown] model){
+  [W1, b1] = affine::init(inputDimension, 200, seed = seed)
+  lseed = ifelse(seed==-1, -1, seed + 1);
+  [W2, b2] = affine::init(200, 200,  seed = lseed)
+  lseed = ifelse(seed==-1, -1, seed + 2);
+  [W3, b3] = affine::init(200, outputDimension, seed = lseed)
+  model = list(W1, W2, W3, b1, b2, b3)
+}
+
+
+predict = function(matrix[double] X,
+                   list[unknown] model)
+    return (matrix[double] probs) {
+
+  W1 = as.matrix(model[1])
+  W2 = as.matrix(model[2])
+  W3 = as.matrix(model[3])
+  b1 = as.matrix(model[4])
+  b2 = as.matrix(model[5])
+  b3 = as.matrix(model[6])
+
+  out1elu = elu::forward(affine::forward(X, W1, b1),1)
+  out2elu = elu::forward(affine::forward(out1elu, W2, b2),1)
+  probs = elu::forward(affine::forward(out2elu, W3, b3),1)
+}
+
+eval = function(matrix[double] probs, matrix[double] y)
+    return (double loss) {
+  loss = logcosh::forward(probs, y)
+}
+
+gradients = function(list[unknown] model,
+                     list[unknown] hyperparams,
+                     matrix[double] features,
+                     matrix[double] labels)
+    return (list[unknown] gradients) {
+
+  W1 = as.matrix(model[1])
+  W2 = as.matrix(model[2])
+  W3 = as.matrix(model[3])
+  b1 = as.matrix(model[4])
+  b2 = as.matrix(model[5])
+  b3 = as.matrix(model[6])
+
+  # Compute forward pass
+  out1 = affine::forward(features, W1, b1)
+  out1elu = elu::forward(out1, 1)
+  out2 = affine::forward(out1elu, W2, b2)
+  out2elu = elu::forward(out2, 1)
+  out3 = affine::forward(out2elu, W3, b3)
+  probs = elu::forward(out3,1)
+
+  # Compute loss & accuracy for training data
+  loss = logcosh::forward(probs, labels)
+  print("Batch loss: " + loss)
+
+  # Compute data backward pass
+  dprobs = logcosh::backward(probs, labels)
+  dout3 = elu::backward(dprobs, out3, 1)
+  [dout2elu, dW3, db3] = affine::backward(dout3, out2elu, W3, b3)
+  dout2 = elu::backward(dout2elu, out2, 1)
+  [dout1elu, dW2, db2] = affine::backward(dout2, out1elu, W2, b2)
+  dout1 = elu::backward(dout1elu, out1, 1)
+  [dfeatures, dW1, db1] = affine::backward(dout1, features, W1, b1)
+
+  gradients = list(dW1, dW2, dW3, db1, db2, db3)
+}
+
+aggregation = function(list[unknown] model,
+                       list[unknown] hyperparams,
+                       list[unknown] gradients)
+    return (list[unknown] model_result) {
+
+  W1 = as.matrix(model[1])
+  W2 = as.matrix(model[2])
+  W3 = as.matrix(model[3])
+  b1 = as.matrix(model[4])
+  b2 = as.matrix(model[5])
+  b3 = as.matrix(model[6])
+  dW1 = as.matrix(gradients[1])
+  dW2 = as.matrix(gradients[2])
+  dW3 = as.matrix(gradients[3])
+  db1 = as.matrix(gradients[4])
+  db2 = as.matrix(gradients[5])
+  db3 = as.matrix(gradients[6])
+  learning_rate = as.double(as.scalar(hyperparams["learning_rate"]))
+
+  # Optimize with SGD
+  W3 = sgd::update(W3, dW3, learning_rate)
+  b3 = sgd::update(b3, db3, learning_rate)
+  W2 = sgd::update(W2, dW2, learning_rate)
+  b2 = sgd::update(b2, db2, learning_rate)
+  W1 = sgd::update(W1, dW1, learning_rate)
+  b1 = sgd::update(b1, db1, learning_rate)
+
+  model_result = list(W1, W2, W3, b1, b2, b3)
+}
+
+
+train = function(matrix[double] X, matrix[double] y,
+                 int epochs, int batch_size, double learning_rate, 
+                 int seed = -1)
+    return (list[unknown] model_trained) {
+
+  N = nrow(X)  # num examples
+  D = ncol(X)  # num features
+  K = ncol(y)  # num classes
+
+  model = init_model(D, K, seed)
+  W1 = as.matrix(model[1])
+  W2 = as.matrix(model[2])
+  W3 = as.matrix(model[3])
+  b1 = as.matrix(model[4])
+  b2 = as.matrix(model[5])
+  b3 = as.matrix(model[6])
+  
+  # Create the hyper parameter list
+  hyperparams = list(learning_rate=learning_rate)
+
+  # Calculate iterations
+  iters = ceil(N / batch_size)
+
+  for (e in 1:epochs) {
+    for(i in 1:iters) {
+      # Create the model list
+      model_list = list(W1, W2, W3, b1, b2, b3)
+
+      # Get next batch
+      beg = ((i-1) * batch_size) %% N + 1
+      end = min(N, beg + batch_size - 1)
+      X_batch = X[beg:end,]
+      y_batch = y[beg:end,]
+
+      gradients_list = gradients(model_list, hyperparams, X_batch, y_batch)
+      model_updated = aggregation(model_list, hyperparams, gradients_list)
+
+      W1 = as.matrix(model_updated[1])
+      W2 = as.matrix(model_updated[2])
+      W3 = as.matrix(model_updated[3])
+      b1 = as.matrix(model_updated[4])
+      b2 = as.matrix(model_updated[5])
+      b3 = as.matrix(model_updated[6])
+
+    }
+  }
+
+  model_trained = list(W1, W2, W3, b1, b2, b3)
+}
+
+train_paramserv = function(matrix[Double] X, matrix[Double] y,
+    Integer epochs, Integer batch_size, Double learning_rate, Integer workers,
+    String utype, String freq, String mode, Integer seed)
+    return (list[unknown] model_trained) {
+
+  N = nrow(X)  # num examples
+  D = ncol(X)  # num features
+  K = ncol(y)  # num classes
+
+  # Create the model list
+  model_list = init_model(D, K, seed)
+
+  # Create the hyper parameter list
+  params = list(learning_rate=learning_rate)
+  
+  # Use paramserv function
+  model_trained = paramserv(model=model_list, features=X, labels=y, 
+    val_features=matrix(0, rows=0, cols=0), val_labels=matrix(0, rows=0, cols=0), 
+    upd="./network/TwoNN.dml::gradients", agg="./network/TwoNN.dml::aggregation",
+    mode=mode, utype=utype, freq=freq, epochs=epochs, batchsize=batch_size,
+    k=workers, hyperparams=params, checkpointing="NONE")
+
+}
+
+save_model = function (list[unknown] model, String baseFolder){
+  W1  = as.matrix(model[1])
+  W2  = as.matrix(model[2])
+  W3  = as.matrix(model[3])
+  b1  = as.matrix(model[4])
+  b2  = as.matrix(model[5])
+  b3  = as.matrix(model[6])
+
+  write(W1, (baseFolder + "/W1.data"), format="binary")
+  write(W2, (baseFolder + "/W2.data"), format="binary")
+  write(W3, (baseFolder + "/W3.data"), format="binary")
+  write(b1, (baseFolder + "/b1.data"), format="binary")
+  write(b2, (baseFolder + "/b2.data") , format="binary")
+  write(b3, (baseFolder + "/b3.data") , format="binary")
+}

Review comment:
       add newline in the end of the script

##########
File path: src/main/python/tests/manual_tests/multi_log_reg_adult.py
##########
@@ -0,0 +1,27 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+
+from systemds.context import SystemDSContext
+from systemds.operator.algorithm import multiLogReg, multiLogRegPredict
+from systemds.examples.tutorials.adult import DataManager
+
+d = DataManager()
+

Review comment:
       does this file actually test?

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,324 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode, Matrix, Frame
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    preprocess_src_path: str = "tests/examples/tutorials/preprocess.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()
+
+
+
+    def test_level1(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True,
+                                                                                           standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy(train_data[:train_count])
+        Y = self.sds.from_numpy(train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+        self.assertGreater(acc, 80) #Todo remove?
+        # todo add text how high acc should be with this config
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+        # todo print confusion matrix? Explain cm?
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583, 502],
+                          [245, 670]])
+            )
+        )
+
+    def test_level2(self):
+
+        train_count = 32561
+        test_count = 16281
+
+        SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+        F1 = self.sds.read(
+            self.dataset_path_train,
+            schema=SCHEMA
+        )
+        F2 = self.sds.read(
+            self.dataset_path_test,
+            schema=SCHEMA
+        )
+
+        jspec = self.sds.read(self.dataset_jspec, data_type="scalar", value_type="string")
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, "preprocess", print_imported_methods=True)
+
+        X1 = F1.rbind(F2)
+        X1, M1 = X1.transform_encode(spec=jspec)
+
+        X = PREPROCESS_package.get_X(X1, 1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+        Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+        Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+        Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+        Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+        # better alternative for encoding. This was intended, but it does not work
+        #F2 = F2.replace("<=50K.", "<=50K")
+        #F2 = F2.replace(">50K.", ">50K")
+        #X1, M = F1.transform_encode(spec=jspec)
+        #X2 = F2.transform_apply(spec=jspec, meta=M)
+
+        #X = PREPROCESS_package.get_X(X1, 1, train_count)
+        #Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+        #Xt = PREPROCESS_package.get_X(X2, 1, test_count)
+        #Yt = PREPROCESS_package.get_Y(X2, 1, test_count)
+
+        # TODO somehow throws error at predict with this included
+        #X, mean, sigma = scale(X, True, True)
+        #Xt = scaleApply(Xt, mean, sigma)
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+        confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+        print(confusion_matrix_abs)
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[11593.,  1545.],
+                          [842., 2302.]])
+            )
+        )
+
+    def test_level3(self):
+        train_count = 32561
+        test_count = 16281
+
+        SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+        F1 = self.sds.read(
+            self.dataset_path_train,
+            schema=SCHEMA
+        )
+        F2 = self.sds.read(
+            self.dataset_path_test,
+            schema=SCHEMA
+        )
+
+        jspec = self.sds.read(self.dataset_jspec, data_type="scalar", value_type="string")
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, "preprocess", print_imported_methods=True)
+
+        X1 = F1.rbind(F2)
+        X1, M1 = X1.transform_encode(spec=jspec)
+
+        X = PREPROCESS_package.get_X(X1, 1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+        Xt = PREPROCESS_package.get_X(X1, train_count, train_count + test_count)
+        Yt = PREPROCESS_package.get_Y(X1, train_count, train_count + test_count)
+
+        Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+        Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+        # better alternative for encoding
+        # F2 = F2.replace("<=50K.", "<=50K")
+        # F2 = F2.replace(">50K.", ">50K")
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+
+        # X = PREPROCESS_package.get_X(X1, 1, train_count)
+        # Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+        # Xt = PREPROCESS_package.get_X(X2, 1, test_count)
+        # Yt = PREPROCESS_package.get_Y(X2, 1, test_count)
+
+        # TODO somehow throws error at predict with this included
+        # X, mean, sigma = scale(X, True, True)
+        # Xt = scaleApply(Xt, mean, sigma)
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        epochs = 1
+        batch_size = 16
+        learning_rate = 0.01
+        seed = 42
+
+        network = FFN_package.train(X, Y, epochs, batch_size, learning_rate, seed)
+
+        """
+        If more ressources are available, one can also choose to train the model using a parameter server.

Review comment:
       in general it is actually a bit faster with the parameter server.
   the use of the parameter server architecture is not necessarily multiple machines

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+The multiLogRegPredict function has three return values:
+    - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example.
+    - y_pred, is the predictions made using the model
+    - acc, is the accuracy achieved by the model.
+
+Step 3: Confusion Matrix
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding
+which classes the model has difficulties separating.
+The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix
+for the predictions and the confusion matrix averages of each true class.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import confusionMatrix
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+In the full script, some steps are combined to reduce the overall script.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 2
+-------
+
+This part of the tutorial shows an overview of the preprocessing capabilities that SystemDS has to offer.
+We will take an unprocessed dataset using the csv format and read it with SystemDS. Then do the heavy lifting for the preprocessing with SystemDS.
+As mentioned before, we want to use the Adult dataset for this task.
+If one wants to skip the explanation, then the full script is available at the end of this level.
+
+Step 1: Metadata and reading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First of all, we need to download the dataset and create a mtd-file for specifying different metadata about the dataset.
+We download the train and test dataset from: https://archive.ics.uci.edu/ml/datasets/adult
+
+The downloaded dataset will be slightly modified for convenience. These modifications entail removing unnecessary newlines at the end of the files and
+adding column names at the top of the files such that the first line looks like:
+
+.. code-block::
+
+    age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
+
+We also delete the line holding the string value |1x3 Cross validator inside the test dataset.
+
+After these modifications, we have to define a mtd file for each file we want to read. This mtd file has to be in the same directory as the dataset.
+In this particular example, the dataset is split into two files "train_data.csv" and "test_data.csv". We want to read both, which means that we will define a mtd file for
+each of them. Those files will be called "train_data.csv.mtd" and "test_data.csv.mtd".
+In these files, we can define certain properties that the file has and also specify which values are supposed to get treated like missing values.
+
+The content of the train_data.csv.mtd file is:
+
+.. code-block::
+
+    {
+    "data_type": "frame",
+    "format": "csv",
+    "header": true,
+    "naStrings": [ "?", "" ],
+    "rows": 32561,
+    "cols": 15
+    }
+
+The "format" of the file is csv, and "header" is set to true because we added the feature names as headers to the csv files.
+The value "data_type" is set to frame, as the preprocessing functions that we use require this datatype.
+The value of "naStrings" is a list of all the string values that should be treated as unknown values during the preprocessing.
+Also, "rows" in our example is set to 32561, as we have this many entries and "cols" is set to 15 as we have 14 features, and one label column inside the files. We will later show how we can split them.
+
+After these requirements are completed, we have to define a SystemDSContext for reading our dataset. We can do this in the following way:
+
+.. code-block:: python
+
+    sds = SystemDSContext()
+
+    train_count = 32561
+    test_count = 16281
+
+With this context we can now define a read operation using the path of the dataset and a schema.
+The schema simply defines the data types for each column.
+
+As already mentioned, SystemDS supports lazy execution by default, which means that the read operation is only executed after calling the compute() function.
+
+.. code-block:: python
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+
+    F1 = sds.read(
+        dataset_path_train,
+        schema=SCHEMA
+    )
+    F2 = sds.read(
+        dataset_path_test,
+        schema=SCHEMA
+    )
+
+Step 2: Defining preprocess operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the read operation has been declared, we can define an additional file for the further preprocessing of the dataset.
+For this, we create a .json file that holds information about the operations that will be performed on individual columns.
+For the sake of this tutorial we will use the file "jspec.json" with the following content:
+
+.. code-block::
+
+    {
+    "impute":
+    [ { "name": "age", "method": "global_mean" }
+     ,{ "name": "workclass" , "method": "global_mode" }
+     ,{ "name": "fnlwgt", "method": "global_mean" }
+     ,{ "name": "education", "method": "global_mode"  }
+     ,{ "name": "education-num", "method": "global_mean" }
+     ,{ "name": "marital-status"      , "method": "global_mode" }
+     ,{ "name": "occupation"        , "method": "global_mode" }
+     ,{ "name": "relationship" , "method": "global_mode" }
+     ,{ "name": "race"        , "method": "global_mode" }
+     ,{ "name": "sex"        , "method": "global_mode" }
+     ,{ "name": "capital-gain", "method": "global_mean" }
+     ,{ "name": "capital-loss", "method": "global_mean" }
+     ,{ "name": "hours-per-week", "method": "global_mean" }
+     ,{ "name": "native-country"        , "method": "global_mode" }
+    ],
+    "bin": [ { "name": "age"  , "method": "equi-width", "numbins": 3 }],
+    "dummycode": ["age", "workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"],
+    "recode": ["income"]
+    }
+
+Our dataset has missing values. An easy way to deal with that circumstance is to use the "impute" option that SystemDS supports.
+We simply pass a list that holds all the relations between column names and the method of interpolation. A more specific example  is the "education" column.
+In the dataset certain entries have missing values for this column. As this is a string feature,
+we can simply define the method as "global_mode" and replace every missing value with the global mode inside this column. It is important to note that
+we first had to define the values of the missing strings in our selected dataset using the .mtd files (naStrings": [ "?", "" ]).
+
+With the "bin" keyword we can discretize continuous values into a small number of bins. Here the column with age values
+is discretized into three age intervals. The only method that is currently supported is equi-width binning.
+
+The column-level data transformation "dummycode" allows us to one-hot-encode a column.
+In our example we first bin the "age" column into 3 different bins. This means that we now have one column where one entry can belong to one of 3 age groups. After using
+"dummycode", we transform this one column into 3 different columns, one for each bin.
+
+At last, we make use of the "recode" transformation for categorical columns, it maps all distinct categories in
+the column into consecutive numbers, starting from 1. In our example we recode the "income" column, which
+transforms it from "<=$50K" and ">$50K" to "1" and "2" respectively.
+
+Another good resource for further ways of processing is: https://apache.github.io/systemds/site/dml-language-reference.html
+
+There we provide different examples for defining jspec's and what functionality is currently supported.
+
+After defining the .jspec file we can read it by passing the filepath, data_type and value_type using the following command:
+
+.. code-block:: python
+
+    dataset_jspec = "adult/jspec.json"
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+
+Finally, we need to define a custom dml file to split the features from the labels and replace certain values, which we will need later.
+We will call this file "preprocess.dml":
+
+.. code-block::
+
+    get_X = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,1:ncol(X)-1]
+    }
+    get_Y = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,ncol(X):ncol(X)]
+    }
+    replace_value = function(matrix[double] X, double pattern , double replacement)
+    return (matrix[double] returnVal) {
+    returnVal = replace(target=X, pattern=pattern, replacement=replacement)

Review comment:
       also replace should be callable directly on the matrix/frame in python.

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+The multiLogRegPredict function has three return values:
+    - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example.
+    - y_pred, is the predictions made using the model
+    - acc, is the accuracy achieved by the model.
+
+Step 3: Confusion Matrix
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding
+which classes the model has difficulties separating.
+The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix
+for the predictions and the confusion matrix averages of each true class.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import confusionMatrix
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+In the full script, some steps are combined to reduce the overall script.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 2
+-------
+
+This part of the tutorial shows an overview of the preprocessing capabilities that SystemDS has to offer.
+We will take an unprocessed dataset using the csv format and read it with SystemDS. Then do the heavy lifting for the preprocessing with SystemDS.
+As mentioned before, we want to use the Adult dataset for this task.
+If one wants to skip the explanation, then the full script is available at the end of this level.
+
+Step 1: Metadata and reading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First of all, we need to download the dataset and create a mtd-file for specifying different metadata about the dataset.
+We download the train and test dataset from: https://archive.ics.uci.edu/ml/datasets/adult
+
+The downloaded dataset will be slightly modified for convenience. These modifications entail removing unnecessary newlines at the end of the files and
+adding column names at the top of the files such that the first line looks like:
+
+.. code-block::
+
+    age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
+
+We also delete the line holding the string value |1x3 Cross validator inside the test dataset.
+
+After these modifications, we have to define a mtd file for each file we want to read. This mtd file has to be in the same directory as the dataset.
+In this particular example, the dataset is split into two files "train_data.csv" and "test_data.csv". We want to read both, which means that we will define a mtd file for
+each of them. Those files will be called "train_data.csv.mtd" and "test_data.csv.mtd".
+In these files, we can define certain properties that the file has and also specify which values are supposed to get treated like missing values.
+
+The content of the train_data.csv.mtd file is:
+
+.. code-block::
+
+    {
+    "data_type": "frame",
+    "format": "csv",
+    "header": true,
+    "naStrings": [ "?", "" ],
+    "rows": 32561,
+    "cols": 15
+    }
+
+The "format" of the file is csv, and "header" is set to true because we added the feature names as headers to the csv files.
+The value "data_type" is set to frame, as the preprocessing functions that we use require this datatype.
+The value of "naStrings" is a list of all the string values that should be treated as unknown values during the preprocessing.
+Also, "rows" in our example is set to 32561, as we have this many entries and "cols" is set to 15 as we have 14 features, and one label column inside the files. We will later show how we can split them.
+
+After these requirements are completed, we have to define a SystemDSContext for reading our dataset. We can do this in the following way:
+
+.. code-block:: python
+
+    sds = SystemDSContext()
+
+    train_count = 32561
+    test_count = 16281
+
+With this context we can now define a read operation using the path of the dataset and a schema.
+The schema simply defines the data types for each column.
+
+As already mentioned, SystemDS supports lazy execution by default, which means that the read operation is only executed after calling the compute() function.
+
+.. code-block:: python
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+
+    F1 = sds.read(
+        dataset_path_train,
+        schema=SCHEMA
+    )
+    F2 = sds.read(
+        dataset_path_test,
+        schema=SCHEMA
+    )
+
+Step 2: Defining preprocess operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the read operation has been declared, we can define an additional file for the further preprocessing of the dataset.
+For this, we create a .json file that holds information about the operations that will be performed on individual columns.
+For the sake of this tutorial we will use the file "jspec.json" with the following content:
+
+.. code-block::
+
+    {
+    "impute":
+    [ { "name": "age", "method": "global_mean" }
+     ,{ "name": "workclass" , "method": "global_mode" }
+     ,{ "name": "fnlwgt", "method": "global_mean" }
+     ,{ "name": "education", "method": "global_mode"  }
+     ,{ "name": "education-num", "method": "global_mean" }
+     ,{ "name": "marital-status"      , "method": "global_mode" }
+     ,{ "name": "occupation"        , "method": "global_mode" }
+     ,{ "name": "relationship" , "method": "global_mode" }
+     ,{ "name": "race"        , "method": "global_mode" }
+     ,{ "name": "sex"        , "method": "global_mode" }
+     ,{ "name": "capital-gain", "method": "global_mean" }
+     ,{ "name": "capital-loss", "method": "global_mean" }
+     ,{ "name": "hours-per-week", "method": "global_mean" }
+     ,{ "name": "native-country"        , "method": "global_mode" }
+    ],
+    "bin": [ { "name": "age"  , "method": "equi-width", "numbins": 3 }],
+    "dummycode": ["age", "workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"],
+    "recode": ["income"]
+    }
+
+Our dataset has missing values. An easy way to deal with that circumstance is to use the "impute" option that SystemDS supports.
+We simply pass a list that holds all the relations between column names and the method of interpolation. A more specific example  is the "education" column.
+In the dataset certain entries have missing values for this column. As this is a string feature,
+we can simply define the method as "global_mode" and replace every missing value with the global mode inside this column. It is important to note that
+we first had to define the values of the missing strings in our selected dataset using the .mtd files (naStrings": [ "?", "" ]).
+
+With the "bin" keyword we can discretize continuous values into a small number of bins. Here the column with age values
+is discretized into three age intervals. The only method that is currently supported is equi-width binning.
+
+The column-level data transformation "dummycode" allows us to one-hot-encode a column.
+In our example we first bin the "age" column into 3 different bins. This means that we now have one column where one entry can belong to one of 3 age groups. After using
+"dummycode", we transform this one column into 3 different columns, one for each bin.
+
+At last, we make use of the "recode" transformation for categorical columns, it maps all distinct categories in
+the column into consecutive numbers, starting from 1. In our example we recode the "income" column, which
+transforms it from "<=$50K" and ">$50K" to "1" and "2" respectively.
+
+Another good resource for further ways of processing is: https://apache.github.io/systemds/site/dml-language-reference.html
+
+There we provide different examples for defining jspec's and what functionality is currently supported.
+
+After defining the .jspec file we can read it by passing the filepath, data_type and value_type using the following command:
+
+.. code-block:: python
+
+    dataset_jspec = "adult/jspec.json"
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+
+Finally, we need to define a custom dml file to split the features from the labels and replace certain values, which we will need later.
+We will call this file "preprocess.dml":
+
+.. code-block::
+
+    get_X = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,1:ncol(X)-1]
+    }
+    get_Y = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,ncol(X):ncol(X)]
+    }
+    replace_value = function(matrix[double] X, double pattern , double replacement)
+    return (matrix[double] returnVal) {
+    returnVal = replace(target=X, pattern=pattern, replacement=replacement)
+    }
+
+The get_X function simply extracts every column except the last one and can also be used to pick certain slices from the dataset.
+The get_Y function only extracts the last column, which in our case holds the labels. Replace_value is used to replace a double value with another double.
+The preprocess.dml file can be read with the following command:
+
+.. code-block:: python
+
+    preprocess_src_path = "preprocess.dml"
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+The print_imported_methods flag can be used to verify whether every method has been parsed correctly.
+
+Step 3: Applying the preprocessing steps
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Generally speaking, we would use the transform_encode function on the train dataset and with the returned encoding call the transform_apply function on the test dataset.
+In the case of the Adult dataset, we have inconsistent label names inside the test dataset and the train dataset, which is why we will show how we can deal with that using SystemDS.
+First of all, we combine the train and the test dataset by using the rbind() function. This function simply appends the Frame F2 at the end of Frame F1.
+This is necessary to ensure that the encoding is identical between train and test dataset.
+
+.. code-block:: python
+
+    X1 = F1.rbind(F2)
+
+In order to use our jspec file we can apply the transform_encode() function. We simply have to pass the read .json file from before.
+In our particular case we obtain the Matrix X1 and the Frame M1 from the operation. X1 holds all the encoded values and M1 holds a mapping between the encoded values
+and all the initial values. Columns that have not been specified in the .json file were not altered.
+
+.. code-block:: python
+
+    X1, M1 = X1.transform_encode(spec=jspec)
+
+We now can use the previously parsed dml file for splitting the dataset and unifying the inconsistent labels. It is noteworthy that the
+file is parsed such that we can directly call the function names from the Python API.
+
+.. code-block:: python
+
+    X = PREPROCESS_package.get_X(X1, 1, train_count)
+    Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+    Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+    Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+    Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+    Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+Step 4: Training and confusion matrix
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that we prepared the data we can use the multiLogReg function.
+These steps are identical to step 2 and 3 that have already been described in level 1 of this tutorial.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    from systemds.operator.algorithm import confusionMatrix
+    from systemds.operator.algorithm import multiLogRegPredict
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+The complete script now can be seen here:
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    train_count = 32561
+    test_count = 16281
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+    dataset_jspec = "adult/jspec.json"
+    preprocess_src_path = "preprocess.dml"
+
+    sds = SystemDSContext()
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    F1 = sds.read(dataset_path_train, schema=SCHEMA)
+    F2 = sds.read(dataset_path_test,  schema=SCHEMA)
+
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+    X1 = F1.rbind(F2)
+    X1, M1 = X1.transform_encode(spec=jspec)
+
+    X = PREPROCESS_package.get_X(X1, 1, train_count)
+    Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+    Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+    Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)

Review comment:
       then we would not need to use source here in level 2, and make it cleaner i now understand why you commented on this in the emails

##########
File path: src/main/python/tests/examples/tutorials/neural_net_source.dml
##########
@@ -0,0 +1,217 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Imports
+source("nn/layers/affine.dml") as affine
+source("nn/layers/logcosh_loss.dml") as logcosh
+source("nn/layers/elu.dml") as elu
+source("nn/layers/sigmoid.dml") as sigmoid
+source("nn/optim/sgd.dml") as sgd
+
+init_model = function(Integer inputDimension, Integer outputDimension, int seed = -1)
+  return(list[unknown] model){
+  [W1, b1] = affine::init(inputDimension, 200, seed = seed)
+  lseed = ifelse(seed==-1, -1, seed + 1);
+  [W2, b2] = affine::init(200, 200,  seed = lseed)
+  lseed = ifelse(seed==-1, -1, seed + 2);
+  [W3, b3] = affine::init(200, outputDimension, seed = lseed)
+  model = list(W1, W2, W3, b1, b2, b3)
+}
+
+
+predict = function(matrix[double] X,
+                   list[unknown] model)
+    return (matrix[double] probs) {
+
+  W1 = as.matrix(model[1])
+  W2 = as.matrix(model[2])
+  W3 = as.matrix(model[3])
+  b1 = as.matrix(model[4])
+  b2 = as.matrix(model[5])
+  b3 = as.matrix(model[6])
+
+  out1elu = elu::forward(affine::forward(X, W1, b1),1)
+  out2elu = elu::forward(affine::forward(out1elu, W2, b2),1)
+  probs = elu::forward(affine::forward(out2elu, W3, b3),1)
+}
+
+eval = function(matrix[double] probs, matrix[double] y)
+    return (double loss) {
+  loss = logcosh::forward(probs, y)
+}
+
+gradients = function(list[unknown] model,
+                     list[unknown] hyperparams,
+                     matrix[double] features,
+                     matrix[double] labels)
+    return (list[unknown] gradients) {
+
+  W1 = as.matrix(model[1])
+  W2 = as.matrix(model[2])
+  W3 = as.matrix(model[3])
+  b1 = as.matrix(model[4])
+  b2 = as.matrix(model[5])
+  b3 = as.matrix(model[6])
+
+  # Compute forward pass
+  out1 = affine::forward(features, W1, b1)
+  out1elu = elu::forward(out1, 1)
+  out2 = affine::forward(out1elu, W2, b2)
+  out2elu = elu::forward(out2, 1)
+  out3 = affine::forward(out2elu, W3, b3)
+  probs = elu::forward(out3,1)
+
+  # Compute loss & accuracy for training data
+  loss = logcosh::forward(probs, labels)
+  print("Batch loss: " + loss)
+
+  # Compute data backward pass
+  dprobs = logcosh::backward(probs, labels)
+  dout3 = elu::backward(dprobs, out3, 1)
+  [dout2elu, dW3, db3] = affine::backward(dout3, out2elu, W3, b3)
+  dout2 = elu::backward(dout2elu, out2, 1)
+  [dout1elu, dW2, db2] = affine::backward(dout2, out1elu, W2, b2)
+  dout1 = elu::backward(dout1elu, out1, 1)
+  [dfeatures, dW1, db1] = affine::backward(dout1, features, W1, b1)
+
+  gradients = list(dW1, dW2, dW3, db1, db2, db3)
+}
+
+aggregation = function(list[unknown] model,
+                       list[unknown] hyperparams,
+                       list[unknown] gradients)
+    return (list[unknown] model_result) {
+
+  W1 = as.matrix(model[1])
+  W2 = as.matrix(model[2])
+  W3 = as.matrix(model[3])
+  b1 = as.matrix(model[4])
+  b2 = as.matrix(model[5])
+  b3 = as.matrix(model[6])
+  dW1 = as.matrix(gradients[1])
+  dW2 = as.matrix(gradients[2])
+  dW3 = as.matrix(gradients[3])
+  db1 = as.matrix(gradients[4])
+  db2 = as.matrix(gradients[5])
+  db3 = as.matrix(gradients[6])
+  learning_rate = as.double(as.scalar(hyperparams["learning_rate"]))
+
+  # Optimize with SGD
+  W3 = sgd::update(W3, dW3, learning_rate)
+  b3 = sgd::update(b3, db3, learning_rate)
+  W2 = sgd::update(W2, dW2, learning_rate)
+  b2 = sgd::update(b2, db2, learning_rate)
+  W1 = sgd::update(W1, dW1, learning_rate)
+  b1 = sgd::update(b1, db1, learning_rate)
+
+  model_result = list(W1, W2, W3, b1, b2, b3)
+}
+
+
+train = function(matrix[double] X, matrix[double] y,
+                 int epochs, int batch_size, double learning_rate, 
+                 int seed = -1)
+    return (list[unknown] model_trained) {
+
+  N = nrow(X)  # num examples
+  D = ncol(X)  # num features
+  K = ncol(y)  # num classes
+
+  model = init_model(D, K, seed)
+  W1 = as.matrix(model[1])
+  W2 = as.matrix(model[2])
+  W3 = as.matrix(model[3])
+  b1 = as.matrix(model[4])
+  b2 = as.matrix(model[5])
+  b3 = as.matrix(model[6])
+  
+  # Create the hyper parameter list
+  hyperparams = list(learning_rate=learning_rate)
+
+  # Calculate iterations
+  iters = ceil(N / batch_size)
+
+  for (e in 1:epochs) {
+    for(i in 1:iters) {
+      # Create the model list
+      model_list = list(W1, W2, W3, b1, b2, b3)
+
+      # Get next batch
+      beg = ((i-1) * batch_size) %% N + 1
+      end = min(N, beg + batch_size - 1)
+      X_batch = X[beg:end,]
+      y_batch = y[beg:end,]
+
+      gradients_list = gradients(model_list, hyperparams, X_batch, y_batch)
+      model_updated = aggregation(model_list, hyperparams, gradients_list)
+
+      W1 = as.matrix(model_updated[1])
+      W2 = as.matrix(model_updated[2])
+      W3 = as.matrix(model_updated[3])
+      b1 = as.matrix(model_updated[4])
+      b2 = as.matrix(model_updated[5])
+      b3 = as.matrix(model_updated[6])
+
+    }
+  }
+
+  model_trained = list(W1, W2, W3, b1, b2, b3)
+}
+
+train_paramserv = function(matrix[Double] X, matrix[Double] y,

Review comment:
       maybe remove the paramserv

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,324 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode, Matrix, Frame
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    preprocess_src_path: str = "tests/examples/tutorials/preprocess.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()
+
+
+
+    def test_level1(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True,
+                                                                                           standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy(train_data[:train_count])
+        Y = self.sds.from_numpy(train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+        self.assertGreater(acc, 80) #Todo remove?
+        # todo add text how high acc should be with this config
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+        # todo print confusion matrix? Explain cm?
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583, 502],
+                          [245, 670]])
+            )
+        )
+
+    def test_level2(self):
+
+        train_count = 32561
+        test_count = 16281
+
+        SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+        F1 = self.sds.read(
+            self.dataset_path_train,
+            schema=SCHEMA
+        )
+        F2 = self.sds.read(
+            self.dataset_path_test,
+            schema=SCHEMA
+        )
+
+        jspec = self.sds.read(self.dataset_jspec, data_type="scalar", value_type="string")
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, "preprocess", print_imported_methods=True)
+
+        X1 = F1.rbind(F2)
+        X1, M1 = X1.transform_encode(spec=jspec)
+
+        X = PREPROCESS_package.get_X(X1, 1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+        Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+        Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+        Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+        Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+        # better alternative for encoding. This was intended, but it does not work
+        #F2 = F2.replace("<=50K.", "<=50K")
+        #F2 = F2.replace(">50K.", ">50K")

Review comment:
       remove the commented code unless it has to do with the comment, if it does then somehow clearly mark that the comment is associated with this code

##########
File path: src/main/python/tests/examples/tutorials/neural_net_source.dml
##########
@@ -0,0 +1,217 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Imports
+source("nn/layers/affine.dml") as affine
+source("nn/layers/logcosh_loss.dml") as logcosh
+source("nn/layers/elu.dml") as elu
+source("nn/layers/sigmoid.dml") as sigmoid
+source("nn/optim/sgd.dml") as sgd
+
+init_model = function(Integer inputDimension, Integer outputDimension, int seed = -1)
+  return(list[unknown] model){
+  [W1, b1] = affine::init(inputDimension, 200, seed = seed)
+  lseed = ifelse(seed==-1, -1, seed + 1);
+  [W2, b2] = affine::init(200, 200,  seed = lseed)
+  lseed = ifelse(seed==-1, -1, seed + 2);
+  [W3, b3] = affine::init(200, outputDimension, seed = lseed)
+  model = list(W1, W2, W3, b1, b2, b3)
+}
+
+
+predict = function(matrix[double] X,
+                   list[unknown] model)
+    return (matrix[double] probs) {
+
+  W1 = as.matrix(model[1])
+  W2 = as.matrix(model[2])
+  W3 = as.matrix(model[3])
+  b1 = as.matrix(model[4])
+  b2 = as.matrix(model[5])
+  b3 = as.matrix(model[6])
+
+  out1elu = elu::forward(affine::forward(X, W1, b1),1)
+  out2elu = elu::forward(affine::forward(out1elu, W2, b2),1)
+  probs = elu::forward(affine::forward(out2elu, W3, b3),1)
+}
+
+eval = function(matrix[double] probs, matrix[double] y)
+    return (double loss) {
+  loss = logcosh::forward(probs, y)
+}
+
+gradients = function(list[unknown] model,
+                     list[unknown] hyperparams,
+                     matrix[double] features,
+                     matrix[double] labels)
+    return (list[unknown] gradients) {
+
+  W1 = as.matrix(model[1])
+  W2 = as.matrix(model[2])
+  W3 = as.matrix(model[3])
+  b1 = as.matrix(model[4])
+  b2 = as.matrix(model[5])
+  b3 = as.matrix(model[6])
+
+  # Compute forward pass
+  out1 = affine::forward(features, W1, b1)
+  out1elu = elu::forward(out1, 1)
+  out2 = affine::forward(out1elu, W2, b2)
+  out2elu = elu::forward(out2, 1)
+  out3 = affine::forward(out2elu, W3, b3)
+  probs = elu::forward(out3,1)
+
+  # Compute loss & accuracy for training data
+  loss = logcosh::forward(probs, labels)
+  print("Batch loss: " + loss)
+
+  # Compute data backward pass
+  dprobs = logcosh::backward(probs, labels)
+  dout3 = elu::backward(dprobs, out3, 1)
+  [dout2elu, dW3, db3] = affine::backward(dout3, out2elu, W3, b3)
+  dout2 = elu::backward(dout2elu, out2, 1)
+  [dout1elu, dW2, db2] = affine::backward(dout2, out1elu, W2, b2)
+  dout1 = elu::backward(dout1elu, out1, 1)
+  [dfeatures, dW1, db1] = affine::backward(dout1, features, W1, b1)
+
+  gradients = list(dW1, dW2, dW3, db1, db2, db3)
+}
+
+aggregation = function(list[unknown] model,
+                       list[unknown] hyperparams,
+                       list[unknown] gradients)
+    return (list[unknown] model_result) {
+
+  W1 = as.matrix(model[1])
+  W2 = as.matrix(model[2])
+  W3 = as.matrix(model[3])
+  b1 = as.matrix(model[4])
+  b2 = as.matrix(model[5])
+  b3 = as.matrix(model[6])
+  dW1 = as.matrix(gradients[1])
+  dW2 = as.matrix(gradients[2])
+  dW3 = as.matrix(gradients[3])
+  db1 = as.matrix(gradients[4])
+  db2 = as.matrix(gradients[5])
+  db3 = as.matrix(gradients[6])
+  learning_rate = as.double(as.scalar(hyperparams["learning_rate"]))
+
+  # Optimize with SGD
+  W3 = sgd::update(W3, dW3, learning_rate)
+  b3 = sgd::update(b3, db3, learning_rate)
+  W2 = sgd::update(W2, dW2, learning_rate)
+  b2 = sgd::update(b2, db2, learning_rate)
+  W1 = sgd::update(W1, dW1, learning_rate)
+  b1 = sgd::update(b1, db1, learning_rate)
+
+  model_result = list(W1, W2, W3, b1, b2, b3)
+}
+
+
+train = function(matrix[double] X, matrix[double] y,
+                 int epochs, int batch_size, double learning_rate, 
+                 int seed = -1)
+    return (list[unknown] model_trained) {
+
+  N = nrow(X)  # num examples
+  D = ncol(X)  # num features
+  K = ncol(y)  # num classes
+
+  model = init_model(D, K, seed)
+  W1 = as.matrix(model[1])
+  W2 = as.matrix(model[2])
+  W3 = as.matrix(model[3])
+  b1 = as.matrix(model[4])
+  b2 = as.matrix(model[5])
+  b3 = as.matrix(model[6])
+  
+  # Create the hyper parameter list
+  hyperparams = list(learning_rate=learning_rate)
+
+  # Calculate iterations
+  iters = ceil(N / batch_size)
+
+  for (e in 1:epochs) {
+    for(i in 1:iters) {
+      # Create the model list
+      model_list = list(W1, W2, W3, b1, b2, b3)
+
+      # Get next batch
+      beg = ((i-1) * batch_size) %% N + 1
+      end = min(N, beg + batch_size - 1)
+      X_batch = X[beg:end,]
+      y_batch = y[beg:end,]
+
+      gradients_list = gradients(model_list, hyperparams, X_batch, y_batch)
+      model_updated = aggregation(model_list, hyperparams, gradients_list)
+
+      W1 = as.matrix(model_updated[1])
+      W2 = as.matrix(model_updated[2])
+      W3 = as.matrix(model_updated[3])
+      b1 = as.matrix(model_updated[4])
+      b2 = as.matrix(model_updated[5])
+      b3 = as.matrix(model_updated[6])
+
+    }
+  }
+
+  model_trained = list(W1, W2, W3, b1, b2, b3)
+}
+
+train_paramserv = function(matrix[Double] X, matrix[Double] y,
+    Integer epochs, Integer batch_size, Double learning_rate, Integer workers,
+    String utype, String freq, String mode, Integer seed)
+    return (list[unknown] model_trained) {
+
+  N = nrow(X)  # num examples
+  D = ncol(X)  # num features
+  K = ncol(y)  # num classes
+
+  # Create the model list
+  model_list = init_model(D, K, seed)
+
+  # Create the hyper parameter list
+  params = list(learning_rate=learning_rate)
+  
+  # Use paramserv function
+  model_trained = paramserv(model=model_list, features=X, labels=y, 
+    val_features=matrix(0, rows=0, cols=0), val_labels=matrix(0, rows=0, cols=0), 
+    upd="./network/TwoNN.dml::gradients", agg="./network/TwoNN.dml::aggregation",
+    mode=mode, utype=utype, freq=freq, epochs=epochs, batchsize=batch_size,
+    k=workers, hyperparams=params, checkpointing="NONE")
+
+}
+
+save_model = function (list[unknown] model, String baseFolder){

Review comment:
       remove the save model

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+The multiLogRegPredict function has three return values:
+    - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example.
+    - y_pred, is the predictions made using the model
+    - acc, is the accuracy achieved by the model.
+
+Step 3: Confusion Matrix
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding
+which classes the model has difficulties separating.
+The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix
+for the predictions and the confusion matrix averages of each true class.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import confusionMatrix
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+In the full script, some steps are combined to reduce the overall script.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 2
+-------
+
+This part of the tutorial shows an overview of the preprocessing capabilities that SystemDS has to offer.
+We will take an unprocessed dataset using the csv format and read it with SystemDS. Then do the heavy lifting for the preprocessing with SystemDS.
+As mentioned before, we want to use the Adult dataset for this task.
+If one wants to skip the explanation, then the full script is available at the end of this level.
+
+Step 1: Metadata and reading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First of all, we need to download the dataset and create a mtd-file for specifying different metadata about the dataset.

Review comment:
       specifying the metadata of the dataset

##########
File path: src/main/python/tests/examples/tutorials/preprocess.dml
##########
@@ -0,0 +1,46 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+get_X = function(matrix[double] X,
+                 int start, int stop)
+    return (matrix[double] returnVal) {
+  returnVal = X[start:stop,1:ncol(X)-1]
+}
+get_Y = function(matrix[double] X,
+                 int start, int stop)
+    return (matrix[double] returnVal) {
+  returnVal = X[start:stop,ncol(X):ncol(X)]
+}
+
+replace_value = function(matrix[double] X,
+                 double pattern , double replacement)
+    return (matrix[double] returnVal) {
+  returnVal = replace(target=X, pattern=pattern, replacement=replacement)
+}
+
+#replace_target_frame = function(String replacement, String to_replace, Frame[Unknown] X)

Review comment:
       remove commented function

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,324 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode, Matrix, Frame
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    preprocess_src_path: str = "tests/examples/tutorials/preprocess.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()
+
+
+
+    def test_level1(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True,
+                                                                                           standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy(train_data[:train_count])
+        Y = self.sds.from_numpy(train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+        self.assertGreater(acc, 80) #Todo remove?
+        # todo add text how high acc should be with this config
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+        # todo print confusion matrix? Explain cm?
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583, 502],
+                          [245, 670]])
+            )
+        )
+
+    def test_level2(self):
+
+        train_count = 32561
+        test_count = 16281
+
+        SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+        F1 = self.sds.read(
+            self.dataset_path_train,
+            schema=SCHEMA
+        )
+        F2 = self.sds.read(
+            self.dataset_path_test,
+            schema=SCHEMA
+        )
+
+        jspec = self.sds.read(self.dataset_jspec, data_type="scalar", value_type="string")
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, "preprocess", print_imported_methods=True)
+
+        X1 = F1.rbind(F2)
+        X1, M1 = X1.transform_encode(spec=jspec)
+
+        X = PREPROCESS_package.get_X(X1, 1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+        Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+        Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+        Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+        Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+        # better alternative for encoding. This was intended, but it does not work

Review comment:
       still the case?

##########
File path: src/main/python/systemds/examples/tutorials/adult.py
##########
@@ -0,0 +1,162 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+
+import array
+import functools
+import gzip
+import operator
+import os
+import struct
+
+import numpy as np
+import pandas as pd
+import requests
+
+class DataManager:
+
+    _train_data_url: str
+    _train_labels_url: str
+    _test_data_url: str
+    _test_labels_url: str
+
+    _train_data_loc: str
+    _train_labels_loc: str
+    _test_data_loc: str
+    _test_labels_loc: str
+
+    _data_columns: []
+    _data_string_labels: []
+
+    def __init__(self):
+        self._train_data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
+        self._test_data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"
+
+        self._train_data_loc = "systemds/examples/tutorials/adult/train_data.csv"
+        self._test_data_loc = "systemds/examples/tutorials/adult/test_data.csv"
+
+        self._data_columns = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation",
+                   "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country",
+                   "income"]
+
+        self._classification_features_labels = [{'workclass': ['Private', 'Self-emp-not-inc', 'Self-emp-inc', 'Federal-gov', 'Local-gov', 'State-gov', 'Without-pay', 'Never-worked']},
+                                    {'education': ['Bachelors', 'Some-college', '11th', 'HS-grad', 'Prof-school', 'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th', 'Masters', '1st-4th', '10th', 'Doctorate', '5th-6th', 'Preschool']},
+                                    {'marital-status': ['Married-civ-spouse', 'Divorced', 'Never-married', 'Separated', 'Widowed', 'Married-spouse-absent', 'Married-AF-spouse']},
+                                    {'occupation': ['Tech-support', 'Craft-repair', 'Other-service', 'Sales', 'Exec-managerial', 'Prof-specialty', 'Handlers-cleaners', 'Machine-op-inspct', 'Adm-clerical', 'Farming-fishing', 'Transport-moving', 'Priv-house-serv', 'Protective-serv', 'Armed-Forces']},
+                                    {'relationship': ['Wife', 'Own-child', 'Husband', 'Not-in-family', 'Other-relative', 'Unmarried']},
+                                    {'race': ['White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other', 'Black']},
+                                    {'sex': ['Female', 'Male']},
+                                    {'native-country': ['United-States', 'Cambodia', 'England', 'Puerto-Rico', 'Canada', 'Germany', 'Outlying-US(Guam-USVI-etc)', 'India', 'Japan', 'Greece', 'South', 'China', 'Cuba', 'Iran', 'Honduras', 'Philippines', 'Italy', 'Poland', 'Jamaica', 'Vietnam', 'Mexico', 'Portugal', 'Ireland', 'France', 'Dominican-Republic', 'Laos', 'Ecuador', 'Taiwan', 'Haiti', 'Columbia', 'Hungary', 'Guatemala', 'Nicaragua', 'Scotland', 'Thailand', 'Yugoslavia', 'El-Salvador', 'Trinadad&Tobago', 'Peru', 'Hong', 'Holand-Netherlands']},
+                                    {'income': ['>50K', '<=50K']}]
+
+
+

Review comment:
       some formatting needed

##########
File path: src/test/resources/datasets/adult/jspec.json
##########
@@ -0,0 +1,24 @@
+{
+    "impute":
+    [ { "name": "age", "method": "global_mean" }
+     ,{ "name": "workclass" , "method": "global_mode" }
+     ,{ "name": "fnlwgt", "method": "global_mean" }
+     ,{ "name": "education", "method": "global_mode"  }
+     ,{ "name": "education-num", "method": "global_mean" }
+     ,{ "name": "marital-status"      , "method": "global_mode" }
+     ,{ "name": "occupation"        , "method": "global_mode" }
+     ,{ "name": "relationship" , "method": "global_mode" }
+     ,{ "name": "race"        , "method": "global_mode" }
+     ,{ "name": "sex"        , "method": "global_mode" }
+     ,{ "name": "capital-gain", "method": "global_mean" }
+     ,{ "name": "capital-loss", "method": "global_mean" }
+     ,{ "name": "hours-per-week", "method": "global_mean" }
+     ,{ "name": "native-country"        , "method": "global_mode" }
+    ],
+    "bin":
+    [ { "name": "age"  , "method": "equi-width", "numbins": 3 }],
+   "dummycode": ["age", "workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"],
+   "recode": ["income"]
+
+
+}

Review comment:
       formatting

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+The multiLogRegPredict function has three return values:
+    - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example.
+    - y_pred, is the predictions made using the model
+    - acc, is the accuracy achieved by the model.
+
+Step 3: Confusion Matrix
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding
+which classes the model has difficulties separating.
+The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix
+for the predictions and the confusion matrix averages of each true class.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import confusionMatrix
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+In the full script, some steps are combined to reduce the overall script.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 2
+-------
+
+This part of the tutorial shows an overview of the preprocessing capabilities that SystemDS has to offer.
+We will take an unprocessed dataset using the csv format and read it with SystemDS. Then do the heavy lifting for the preprocessing with SystemDS.
+As mentioned before, we want to use the Adult dataset for this task.
+If one wants to skip the explanation, then the full script is available at the end of this level.
+
+Step 1: Metadata and reading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First of all, we need to download the dataset and create a mtd-file for specifying different metadata about the dataset.
+We download the train and test dataset from: https://archive.ics.uci.edu/ml/datasets/adult
+
+The downloaded dataset will be slightly modified for convenience. These modifications entail removing unnecessary newlines at the end of the files and
+adding column names at the top of the files such that the first line looks like:
+
+.. code-block::
+
+    age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
+
+We also delete the line holding the string value |1x3 Cross validator inside the test dataset.
+
+After these modifications, we have to define a mtd file for each file we want to read. This mtd file has to be in the same directory as the dataset.
+In this particular example, the dataset is split into two files "train_data.csv" and "test_data.csv". We want to read both, which means that we will define a mtd file for
+each of them. Those files will be called "train_data.csv.mtd" and "test_data.csv.mtd".
+In these files, we can define certain properties that the file has and also specify which values are supposed to get treated like missing values.
+
+The content of the train_data.csv.mtd file is:
+
+.. code-block::
+
+    {
+    "data_type": "frame",
+    "format": "csv",
+    "header": true,
+    "naStrings": [ "?", "" ],
+    "rows": 32561,
+    "cols": 15
+    }
+
+The "format" of the file is csv, and "header" is set to true because we added the feature names as headers to the csv files.
+The value "data_type" is set to frame, as the preprocessing functions that we use require this datatype.
+The value of "naStrings" is a list of all the string values that should be treated as unknown values during the preprocessing.
+Also, "rows" in our example is set to 32561, as we have this many entries and "cols" is set to 15 as we have 14 features, and one label column inside the files. We will later show how we can split them.
+
+After these requirements are completed, we have to define a SystemDSContext for reading our dataset. We can do this in the following way:
+
+.. code-block:: python
+
+    sds = SystemDSContext()
+
+    train_count = 32561
+    test_count = 16281
+
+With this context we can now define a read operation using the path of the dataset and a schema.
+The schema simply defines the data types for each column.
+
+As already mentioned, SystemDS supports lazy execution by default, which means that the read operation is only executed after calling the compute() function.
+
+.. code-block:: python
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+
+    F1 = sds.read(
+        dataset_path_train,
+        schema=SCHEMA
+    )
+    F2 = sds.read(
+        dataset_path_test,
+        schema=SCHEMA
+    )
+
+Step 2: Defining preprocess operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the read operation has been declared, we can define an additional file for the further preprocessing of the dataset.
+For this, we create a .json file that holds information about the operations that will be performed on individual columns.
+For the sake of this tutorial we will use the file "jspec.json" with the following content:
+
+.. code-block::
+
+    {
+    "impute":
+    [ { "name": "age", "method": "global_mean" }
+     ,{ "name": "workclass" , "method": "global_mode" }
+     ,{ "name": "fnlwgt", "method": "global_mean" }
+     ,{ "name": "education", "method": "global_mode"  }
+     ,{ "name": "education-num", "method": "global_mean" }
+     ,{ "name": "marital-status"      , "method": "global_mode" }
+     ,{ "name": "occupation"        , "method": "global_mode" }
+     ,{ "name": "relationship" , "method": "global_mode" }
+     ,{ "name": "race"        , "method": "global_mode" }
+     ,{ "name": "sex"        , "method": "global_mode" }
+     ,{ "name": "capital-gain", "method": "global_mean" }
+     ,{ "name": "capital-loss", "method": "global_mean" }
+     ,{ "name": "hours-per-week", "method": "global_mean" }
+     ,{ "name": "native-country"        , "method": "global_mode" }
+    ],
+    "bin": [ { "name": "age"  , "method": "equi-width", "numbins": 3 }],
+    "dummycode": ["age", "workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"],
+    "recode": ["income"]
+    }
+
+Our dataset has missing values. An easy way to deal with that circumstance is to use the "impute" option that SystemDS supports.
+We simply pass a list that holds all the relations between column names and the method of interpolation. A more specific example  is the "education" column.
+In the dataset certain entries have missing values for this column. As this is a string feature,
+we can simply define the method as "global_mode" and replace every missing value with the global mode inside this column. It is important to note that
+we first had to define the values of the missing strings in our selected dataset using the .mtd files (naStrings": [ "?", "" ]).
+
+With the "bin" keyword we can discretize continuous values into a small number of bins. Here the column with age values
+is discretized into three age intervals. The only method that is currently supported is equi-width binning.
+
+The column-level data transformation "dummycode" allows us to one-hot-encode a column.
+In our example we first bin the "age" column into 3 different bins. This means that we now have one column where one entry can belong to one of 3 age groups. After using
+"dummycode", we transform this one column into 3 different columns, one for each bin.
+
+At last, we make use of the "recode" transformation for categorical columns, it maps all distinct categories in
+the column into consecutive numbers, starting from 1. In our example we recode the "income" column, which
+transforms it from "<=$50K" and ">$50K" to "1" and "2" respectively.
+
+Another good resource for further ways of processing is: https://apache.github.io/systemds/site/dml-language-reference.html
+
+There we provide different examples for defining jspec's and what functionality is currently supported.
+
+After defining the .jspec file we can read it by passing the filepath, data_type and value_type using the following command:
+
+.. code-block:: python
+
+    dataset_jspec = "adult/jspec.json"
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+
+Finally, we need to define a custom dml file to split the features from the labels and replace certain values, which we will need later.
+We will call this file "preprocess.dml":
+
+.. code-block::
+
+    get_X = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,1:ncol(X)-1]
+    }
+    get_Y = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,ncol(X):ncol(X)]
+    }
+    replace_value = function(matrix[double] X, double pattern , double replacement)
+    return (matrix[double] returnVal) {
+    returnVal = replace(target=X, pattern=pattern, replacement=replacement)
+    }
+
+The get_X function simply extracts every column except the last one and can also be used to pick certain slices from the dataset.
+The get_Y function only extracts the last column, which in our case holds the labels. Replace_value is used to replace a double value with another double.
+The preprocess.dml file can be read with the following command:
+
+.. code-block:: python
+
+    preprocess_src_path = "preprocess.dml"
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+The print_imported_methods flag can be used to verify whether every method has been parsed correctly.
+
+Step 3: Applying the preprocessing steps
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Generally speaking, we would use the transform_encode function on the train dataset and with the returned encoding call the transform_apply function on the test dataset.
+In the case of the Adult dataset, we have inconsistent label names inside the test dataset and the train dataset, which is why we will show how we can deal with that using SystemDS.
+First of all, we combine the train and the test dataset by using the rbind() function. This function simply appends the Frame F2 at the end of Frame F1.
+This is necessary to ensure that the encoding is identical between train and test dataset.

Review comment:
       the encoding should be extracted from the training data, and then applied to the test data, this is why we have two different outputs from transform_encode. you can also argue that you need to save the M, to be able to produce predictionson new data, since you need to apply the same preprocessing on unseen data.
   
   if i remember correctly the code is something like:
   
   X, M = X.transform_encode(spec=jspec)
   X_test = X_test.transform_apply(M)
   
   

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,324 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode, Matrix, Frame
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    preprocess_src_path: str = "tests/examples/tutorials/preprocess.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()
+
+
+
+    def test_level1(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True,
+                                                                                           standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy(train_data[:train_count])
+        Y = self.sds.from_numpy(train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+        self.assertGreater(acc, 80) #Todo remove?
+        # todo add text how high acc should be with this config
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+        # todo print confusion matrix? Explain cm?
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583, 502],
+                          [245, 670]])
+            )
+        )
+
+    def test_level2(self):
+
+        train_count = 32561
+        test_count = 16281
+
+        SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+        F1 = self.sds.read(
+            self.dataset_path_train,
+            schema=SCHEMA
+        )
+        F2 = self.sds.read(
+            self.dataset_path_test,
+            schema=SCHEMA
+        )
+
+        jspec = self.sds.read(self.dataset_jspec, data_type="scalar", value_type="string")
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, "preprocess", print_imported_methods=True)
+
+        X1 = F1.rbind(F2)
+        X1, M1 = X1.transform_encode(spec=jspec)
+
+        X = PREPROCESS_package.get_X(X1, 1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+        Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+        Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+        Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+        Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+        # better alternative for encoding. This was intended, but it does not work
+        #F2 = F2.replace("<=50K.", "<=50K")
+        #F2 = F2.replace(">50K.", ">50K")
+        #X1, M = F1.transform_encode(spec=jspec)
+        #X2 = F2.transform_apply(spec=jspec, meta=M)
+
+        #X = PREPROCESS_package.get_X(X1, 1, train_count)
+        #Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+        #Xt = PREPROCESS_package.get_X(X2, 1, test_count)
+        #Yt = PREPROCESS_package.get_Y(X2, 1, test_count)
+
+        # TODO somehow throws error at predict with this included
+        #X, mean, sigma = scale(X, True, True)
+        #Xt = scaleApply(Xt, mean, sigma)
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+        confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+        print(confusion_matrix_abs)
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[11593.,  1545.],
+                          [842., 2302.]])
+            )
+        )
+
+    def test_level3(self):
+        train_count = 32561
+        test_count = 16281
+
+        SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+        F1 = self.sds.read(
+            self.dataset_path_train,
+            schema=SCHEMA
+        )
+        F2 = self.sds.read(
+            self.dataset_path_test,
+            schema=SCHEMA
+        )
+
+        jspec = self.sds.read(self.dataset_jspec, data_type="scalar", value_type="string")
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, "preprocess", print_imported_methods=True)
+
+        X1 = F1.rbind(F2)
+        X1, M1 = X1.transform_encode(spec=jspec)
+
+        X = PREPROCESS_package.get_X(X1, 1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+        Xt = PREPROCESS_package.get_X(X1, train_count, train_count + test_count)
+        Yt = PREPROCESS_package.get_Y(X1, train_count, train_count + test_count)
+
+        Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+        Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+        # better alternative for encoding
+        # F2 = F2.replace("<=50K.", "<=50K")
+        # F2 = F2.replace(">50K.", ">50K")
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+
+        # X = PREPROCESS_package.get_X(X1, 1, train_count)
+        # Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+        # Xt = PREPROCESS_package.get_X(X2, 1, test_count)
+        # Yt = PREPROCESS_package.get_Y(X2, 1, test_count)
+
+        # TODO somehow throws error at predict with this included
+        # X, mean, sigma = scale(X, True, True)
+        # Xt = scaleApply(Xt, mean, sigma)
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        epochs = 1
+        batch_size = 16
+        learning_rate = 0.01
+        seed = 42
+
+        network = FFN_package.train(X, Y, epochs, batch_size, learning_rate, seed)
+
+        """
+        If more ressources are available, one can also choose to train the model using a parameter server.
+        Here we use the same parameters as before, however we need to specifiy a few more.
+        """
+        ################################################################################################################
+        # workers = 1
+        # utype = '"BSP"'
+        # freq = '"EPOCH"'
+        # mode = '"LOCAL"'
+        # network = FFN_package.train_paramserv(X, Y, epochs,
+        #                                       batch_size, learning_rate, workers, utype, freq, mode,
+        #                                       seed)
+        ################################################################################################################
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        """
+        Next we evaluate our network on the test set which was not used for training.
+        The predict function with the test features and our trained network returns a matrix of class probabilities.
+        This matrix contains for each test sample the probabilities for each class.
+        For predicting the most likely class of a sample, we choose the class with the highest probability.
+        """
+        ################################################################################################################
+        #probs = FFN_package.predict(Xt, network)

Review comment:
       this was not working, correct?

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+The multiLogRegPredict function has three return values:
+    - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example.
+    - y_pred, is the predictions made using the model
+    - acc, is the accuracy achieved by the model.
+
+Step 3: Confusion Matrix
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding
+which classes the model has difficulties separating.
+The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix
+for the predictions and the confusion matrix averages of each true class.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import confusionMatrix
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+In the full script, some steps are combined to reduce the overall script.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 2
+-------
+
+This part of the tutorial shows an overview of the preprocessing capabilities that SystemDS has to offer.
+We will take an unprocessed dataset using the csv format and read it with SystemDS. Then do the heavy lifting for the preprocessing with SystemDS.
+As mentioned before, we want to use the Adult dataset for this task.
+If one wants to skip the explanation, then the full script is available at the end of this level.
+
+Step 1: Metadata and reading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First of all, we need to download the dataset and create a mtd-file for specifying different metadata about the dataset.
+We download the train and test dataset from: https://archive.ics.uci.edu/ml/datasets/adult
+
+The downloaded dataset will be slightly modified for convenience. These modifications entail removing unnecessary newlines at the end of the files and
+adding column names at the top of the files such that the first line looks like:
+
+.. code-block::
+
+    age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
+
+We also delete the line holding the string value |1x3 Cross validator inside the test dataset.
+
+After these modifications, we have to define a mtd file for each file we want to read. This mtd file has to be in the same directory as the dataset.
+In this particular example, the dataset is split into two files "train_data.csv" and "test_data.csv". We want to read both, which means that we will define a mtd file for
+each of them. Those files will be called "train_data.csv.mtd" and "test_data.csv.mtd".
+In these files, we can define certain properties that the file has and also specify which values are supposed to get treated like missing values.
+
+The content of the train_data.csv.mtd file is:
+
+.. code-block::
+
+    {
+    "data_type": "frame",
+    "format": "csv",
+    "header": true,
+    "naStrings": [ "?", "" ],
+    "rows": 32561,
+    "cols": 15
+    }
+
+The "format" of the file is csv, and "header" is set to true because we added the feature names as headers to the csv files.
+The value "data_type" is set to frame, as the preprocessing functions that we use require this datatype.
+The value of "naStrings" is a list of all the string values that should be treated as unknown values during the preprocessing.
+Also, "rows" in our example is set to 32561, as we have this many entries and "cols" is set to 15 as we have 14 features, and one label column inside the files. We will later show how we can split them.
+
+After these requirements are completed, we have to define a SystemDSContext for reading our dataset. We can do this in the following way:
+
+.. code-block:: python
+
+    sds = SystemDSContext()
+
+    train_count = 32561
+    test_count = 16281
+
+With this context we can now define a read operation using the path of the dataset and a schema.
+The schema simply defines the data types for each column.
+
+As already mentioned, SystemDS supports lazy execution by default, which means that the read operation is only executed after calling the compute() function.
+
+.. code-block:: python
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+
+    F1 = sds.read(
+        dataset_path_train,
+        schema=SCHEMA
+    )
+    F2 = sds.read(
+        dataset_path_test,
+        schema=SCHEMA
+    )
+
+Step 2: Defining preprocess operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the read operation has been declared, we can define an additional file for the further preprocessing of the dataset.
+For this, we create a .json file that holds information about the operations that will be performed on individual columns.
+For the sake of this tutorial we will use the file "jspec.json" with the following content:
+
+.. code-block::
+
+    {
+    "impute":
+    [ { "name": "age", "method": "global_mean" }
+     ,{ "name": "workclass" , "method": "global_mode" }
+     ,{ "name": "fnlwgt", "method": "global_mean" }
+     ,{ "name": "education", "method": "global_mode"  }
+     ,{ "name": "education-num", "method": "global_mean" }
+     ,{ "name": "marital-status"      , "method": "global_mode" }
+     ,{ "name": "occupation"        , "method": "global_mode" }
+     ,{ "name": "relationship" , "method": "global_mode" }
+     ,{ "name": "race"        , "method": "global_mode" }
+     ,{ "name": "sex"        , "method": "global_mode" }
+     ,{ "name": "capital-gain", "method": "global_mean" }
+     ,{ "name": "capital-loss", "method": "global_mean" }
+     ,{ "name": "hours-per-week", "method": "global_mean" }
+     ,{ "name": "native-country"        , "method": "global_mode" }
+    ],
+    "bin": [ { "name": "age"  , "method": "equi-width", "numbins": 3 }],
+    "dummycode": ["age", "workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"],
+    "recode": ["income"]
+    }
+
+Our dataset has missing values. An easy way to deal with that circumstance is to use the "impute" option that SystemDS supports.
+We simply pass a list that holds all the relations between column names and the method of interpolation. A more specific example  is the "education" column.
+In the dataset certain entries have missing values for this column. As this is a string feature,
+we can simply define the method as "global_mode" and replace every missing value with the global mode inside this column. It is important to note that
+we first had to define the values of the missing strings in our selected dataset using the .mtd files (naStrings": [ "?", "" ]).
+
+With the "bin" keyword we can discretize continuous values into a small number of bins. Here the column with age values
+is discretized into three age intervals. The only method that is currently supported is equi-width binning.
+
+The column-level data transformation "dummycode" allows us to one-hot-encode a column.
+In our example we first bin the "age" column into 3 different bins. This means that we now have one column where one entry can belong to one of 3 age groups. After using
+"dummycode", we transform this one column into 3 different columns, one for each bin.
+
+At last, we make use of the "recode" transformation for categorical columns, it maps all distinct categories in
+the column into consecutive numbers, starting from 1. In our example we recode the "income" column, which
+transforms it from "<=$50K" and ">$50K" to "1" and "2" respectively.
+
+Another good resource for further ways of processing is: https://apache.github.io/systemds/site/dml-language-reference.html
+
+There we provide different examples for defining jspec's and what functionality is currently supported.
+
+After defining the .jspec file we can read it by passing the filepath, data_type and value_type using the following command:
+
+.. code-block:: python
+
+    dataset_jspec = "adult/jspec.json"
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+
+Finally, we need to define a custom dml file to split the features from the labels and replace certain values, which we will need later.
+We will call this file "preprocess.dml":
+
+.. code-block::
+
+    get_X = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,1:ncol(X)-1]
+    }
+    get_Y = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,ncol(X):ncol(X)]
+    }
+    replace_value = function(matrix[double] X, double pattern , double replacement)
+    return (matrix[double] returnVal) {
+    returnVal = replace(target=X, pattern=pattern, replacement=replacement)
+    }
+
+The get_X function simply extracts every column except the last one and can also be used to pick certain slices from the dataset.
+The get_Y function only extracts the last column, which in our case holds the labels. Replace_value is used to replace a double value with another double.
+The preprocess.dml file can be read with the following command:
+
+.. code-block:: python
+
+    preprocess_src_path = "preprocess.dml"
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+The print_imported_methods flag can be used to verify whether every method has been parsed correctly.
+
+Step 3: Applying the preprocessing steps
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Generally speaking, we would use the transform_encode function on the train dataset and with the returned encoding call the transform_apply function on the test dataset.
+In the case of the Adult dataset, we have inconsistent label names inside the test dataset and the train dataset, which is why we will show how we can deal with that using SystemDS.
+First of all, we combine the train and the test dataset by using the rbind() function. This function simply appends the Frame F2 at the end of Frame F1.
+This is necessary to ensure that the encoding is identical between train and test dataset.
+
+.. code-block:: python
+
+    X1 = F1.rbind(F2)
+
+In order to use our jspec file we can apply the transform_encode() function. We simply have to pass the read .json file from before.
+In our particular case we obtain the Matrix X1 and the Frame M1 from the operation. X1 holds all the encoded values and M1 holds a mapping between the encoded values
+and all the initial values. Columns that have not been specified in the .json file were not altered.
+
+.. code-block:: python
+
+    X1, M1 = X1.transform_encode(spec=jspec)
+
+We now can use the previously parsed dml file for splitting the dataset and unifying the inconsistent labels. It is noteworthy that the
+file is parsed such that we can directly call the function names from the Python API.
+
+.. code-block:: python
+
+    X = PREPROCESS_package.get_X(X1, 1, train_count)
+    Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+    Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+    Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+    Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+    Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+Step 4: Training and confusion matrix
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that we prepared the data we can use the multiLogReg function.
+These steps are identical to step 2 and 3 that have already been described in level 1 of this tutorial.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    from systemds.operator.algorithm import confusionMatrix
+    from systemds.operator.algorithm import multiLogRegPredict
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+The complete script now can be seen here:
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    train_count = 32561
+    test_count = 16281
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+    dataset_jspec = "adult/jspec.json"
+    preprocess_src_path = "preprocess.dml"
+
+    sds = SystemDSContext()
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    F1 = sds.read(dataset_path_train, schema=SCHEMA)
+    F2 = sds.read(dataset_path_test,  schema=SCHEMA)
+
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+    X1 = F1.rbind(F2)
+    X1, M1 = X1.transform_encode(spec=jspec)
+
+    X = PREPROCESS_package.get_X(X1, 1, train_count)
+    Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+    Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+    Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+    Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+    Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+    betas = multiLogReg(X, Y)
+
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 3
+-------
+
+In this level we want to show how we can integrate a custom built algorithm using the Python API.
+For this we will introduce another dml file, which can be used to train a basic feed forward network.
+
+Step 1: Obtain data
+~~~~~~~~~~~~~~~~~~~
+
+For the whole data setup please refer to level 2, Step 1 to 3, as these steps are identical.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+
+    train_count = 32561
+    test_count = 16281
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+    dataset_jspec = "adult/jspec.json"
+    preprocess_src_path = "preprocess.dml"
+
+    sds = SystemDSContext()
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    F1 = sds.read(dataset_path_train, schema=SCHEMA)
+    F2 = sds.read(dataset_path_test,  schema=SCHEMA)
+
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+    X1 = F1.rbind(F2)
+    X1, M1 = X1.transform_encode(spec=jspec)
+
+    X = PREPROCESS_package.get_X(X1, 1, train_count)
+    Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+    Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+    Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+    Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+    Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+Step 2: Load the algorithm
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We use a neural network with 2 hidden layers, each consisting of 200 neurons.
+First, we need to source the dml file for neural networks.
+This file includes all the necessary functions for training, evaluating, and storing the model.
+The returned object of the source call is further used for calling the functions.
+The file can be found here:
+
+    - :doc:tests/examples/tutorials/neural_net_source.dml
+
+.. code-block:: python
+
+    FFN_package = sds.source(neural_net_src_path, "fnn", print_imported_methods=True))
+
+Step 3: Training the neural network
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Training a neural network in SystemDS using the train function is straightforward.
+The first two arguments are the training features and the target values we want to fit our model on.
+Then we need to set the hyperparameters of the model.
+We choose to train for 1 epoch with a batch size of 16 and a learning rate of 0.01, which are common parameters for neural networks.
+The seed argument ensures that running the code again yields the same results.
+
+.. code-block:: python
+
+    epochs = 1
+    batch_size = 16
+    learning_rate = 0.01
+    seed = 42
+
+    network = FFN_package.train(X, Y, epochs, batch_size, learning_rate, seed)
+
+Step 4: Saving the model
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+For later usage, we can save the trained model.
+We only need to specify the name of our model and the file path.
+This call stores the weights and biases of our model.
+
+.. code-block:: python
+
+    FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+Full Script
+~~~~~~~~~~~
+
+The complete script now can be seen here:
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+
+    train_count = 32561
+    test_count = 16281
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+    dataset_jspec = "adult/jspec.json"
+    preprocess_src_path = "preprocess.dml"
+    neural_net_src_path = "neural_net_source.dml"
+
+    sds = SystemDSContext()
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    F1 = sds.read(dataset_path_train, schema=SCHEMA)
+    F2 = sds.read(dataset_path_test,  schema=SCHEMA)
+
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+    X1 = F1.rbind(F2)
+    X1, M1 = X1.transform_encode(spec=jspec)
+
+    X = PREPROCESS_package.get_X(X1, 1, train_count)
+    Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+    Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+    Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+    Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+    Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+    FFN_package = sds.source(neural_net_src_path, "fnn", print_imported_methods=True)
+
+    epochs = 1
+    batch_size = 16
+    learning_rate = 0.01
+    seed = 42
+
+    network = FFN_package.train(X, Y, epochs, batch_size, learning_rate, seed)
+
+    FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)

Review comment:
       we can now save lists directly. therefore just simply call write, also when reading in the model again later, simply call read.

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+The multiLogRegPredict function has three return values:
+    - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example.
+    - y_pred, is the predictions made using the model
+    - acc, is the accuracy achieved by the model.
+
+Step 3: Confusion Matrix
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding
+which classes the model has difficulties separating.
+The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix
+for the predictions and the confusion matrix averages of each true class.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import confusionMatrix
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+In the full script, some steps are combined to reduce the overall script.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 2
+-------
+
+This part of the tutorial shows an overview of the preprocessing capabilities that SystemDS has to offer.
+We will take an unprocessed dataset using the csv format and read it with SystemDS. Then do the heavy lifting for the preprocessing with SystemDS.
+As mentioned before, we want to use the Adult dataset for this task.
+If one wants to skip the explanation, then the full script is available at the end of this level.
+
+Step 1: Metadata and reading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First of all, we need to download the dataset and create a mtd-file for specifying different metadata about the dataset.
+We download the train and test dataset from: https://archive.ics.uci.edu/ml/datasets/adult
+
+The downloaded dataset will be slightly modified for convenience. These modifications entail removing unnecessary newlines at the end of the files and
+adding column names at the top of the files such that the first line looks like:
+
+.. code-block::
+
+    age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
+
+We also delete the line holding the string value |1x3 Cross validator inside the test dataset.
+
+After these modifications, we have to define a mtd file for each file we want to read. This mtd file has to be in the same directory as the dataset.
+In this particular example, the dataset is split into two files "train_data.csv" and "test_data.csv". We want to read both, which means that we will define a mtd file for
+each of them. Those files will be called "train_data.csv.mtd" and "test_data.csv.mtd".
+In these files, we can define certain properties that the file has and also specify which values are supposed to get treated like missing values.
+
+The content of the train_data.csv.mtd file is:
+
+.. code-block::
+
+    {
+    "data_type": "frame",
+    "format": "csv",
+    "header": true,
+    "naStrings": [ "?", "" ],
+    "rows": 32561,
+    "cols": 15
+    }
+
+The "format" of the file is csv, and "header" is set to true because we added the feature names as headers to the csv files.
+The value "data_type" is set to frame, as the preprocessing functions that we use require this datatype.
+The value of "naStrings" is a list of all the string values that should be treated as unknown values during the preprocessing.
+Also, "rows" in our example is set to 32561, as we have this many entries and "cols" is set to 15 as we have 14 features, and one label column inside the files. We will later show how we can split them.
+
+After these requirements are completed, we have to define a SystemDSContext for reading our dataset. We can do this in the following way:
+
+.. code-block:: python
+
+    sds = SystemDSContext()
+
+    train_count = 32561
+    test_count = 16281
+
+With this context we can now define a read operation using the path of the dataset and a schema.
+The schema simply defines the data types for each column.
+
+As already mentioned, SystemDS supports lazy execution by default, which means that the read operation is only executed after calling the compute() function.
+
+.. code-block:: python
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+
+    F1 = sds.read(
+        dataset_path_train,
+        schema=SCHEMA
+    )
+    F2 = sds.read(
+        dataset_path_test,
+        schema=SCHEMA
+    )
+
+Step 2: Defining preprocess operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the read operation has been declared, we can define an additional file for the further preprocessing of the dataset.
+For this, we create a .json file that holds information about the operations that will be performed on individual columns.
+For the sake of this tutorial we will use the file "jspec.json" with the following content:
+
+.. code-block::
+
+    {
+    "impute":
+    [ { "name": "age", "method": "global_mean" }
+     ,{ "name": "workclass" , "method": "global_mode" }
+     ,{ "name": "fnlwgt", "method": "global_mean" }
+     ,{ "name": "education", "method": "global_mode"  }
+     ,{ "name": "education-num", "method": "global_mean" }
+     ,{ "name": "marital-status"      , "method": "global_mode" }
+     ,{ "name": "occupation"        , "method": "global_mode" }
+     ,{ "name": "relationship" , "method": "global_mode" }
+     ,{ "name": "race"        , "method": "global_mode" }
+     ,{ "name": "sex"        , "method": "global_mode" }
+     ,{ "name": "capital-gain", "method": "global_mean" }
+     ,{ "name": "capital-loss", "method": "global_mean" }
+     ,{ "name": "hours-per-week", "method": "global_mean" }
+     ,{ "name": "native-country"        , "method": "global_mode" }
+    ],
+    "bin": [ { "name": "age"  , "method": "equi-width", "numbins": 3 }],
+    "dummycode": ["age", "workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"],
+    "recode": ["income"]
+    }
+
+Our dataset has missing values. An easy way to deal with that circumstance is to use the "impute" option that SystemDS supports.
+We simply pass a list that holds all the relations between column names and the method of interpolation. A more specific example  is the "education" column.
+In the dataset certain entries have missing values for this column. As this is a string feature,
+we can simply define the method as "global_mode" and replace every missing value with the global mode inside this column. It is important to note that
+we first had to define the values of the missing strings in our selected dataset using the .mtd files (naStrings": [ "?", "" ]).
+
+With the "bin" keyword we can discretize continuous values into a small number of bins. Here the column with age values
+is discretized into three age intervals. The only method that is currently supported is equi-width binning.
+
+The column-level data transformation "dummycode" allows us to one-hot-encode a column.
+In our example we first bin the "age" column into 3 different bins. This means that we now have one column where one entry can belong to one of 3 age groups. After using
+"dummycode", we transform this one column into 3 different columns, one for each bin.
+
+At last, we make use of the "recode" transformation for categorical columns, it maps all distinct categories in
+the column into consecutive numbers, starting from 1. In our example we recode the "income" column, which
+transforms it from "<=$50K" and ">$50K" to "1" and "2" respectively.
+
+Another good resource for further ways of processing is: https://apache.github.io/systemds/site/dml-language-reference.html
+
+There we provide different examples for defining jspec's and what functionality is currently supported.
+
+After defining the .jspec file we can read it by passing the filepath, data_type and value_type using the following command:
+
+.. code-block:: python
+
+    dataset_jspec = "adult/jspec.json"
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+
+Finally, we need to define a custom dml file to split the features from the labels and replace certain values, which we will need later.
+We will call this file "preprocess.dml":
+
+.. code-block::
+
+    get_X = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,1:ncol(X)-1]
+    }
+    get_Y = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,ncol(X):ncol(X)]
+    }
+    replace_value = function(matrix[double] X, double pattern , double replacement)
+    return (matrix[double] returnVal) {
+    returnVal = replace(target=X, pattern=pattern, replacement=replacement)

Review comment:
       indent

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+The multiLogRegPredict function has three return values:
+    - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example.
+    - y_pred, is the predictions made using the model
+    - acc, is the accuracy achieved by the model.
+
+Step 3: Confusion Matrix
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding
+which classes the model has difficulties separating.
+The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix
+for the predictions and the confusion matrix averages of each true class.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import confusionMatrix
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+In the full script, some steps are combined to reduce the overall script.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 2
+-------
+
+This part of the tutorial shows an overview of the preprocessing capabilities that SystemDS has to offer.
+We will take an unprocessed dataset using the csv format and read it with SystemDS. Then do the heavy lifting for the preprocessing with SystemDS.
+As mentioned before, we want to use the Adult dataset for this task.
+If one wants to skip the explanation, then the full script is available at the end of this level.
+
+Step 1: Metadata and reading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First of all, we need to download the dataset and create a mtd-file for specifying different metadata about the dataset.
+We download the train and test dataset from: https://archive.ics.uci.edu/ml/datasets/adult
+
+The downloaded dataset will be slightly modified for convenience. These modifications entail removing unnecessary newlines at the end of the files and
+adding column names at the top of the files such that the first line looks like:
+
+.. code-block::
+
+    age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
+
+We also delete the line holding the string value |1x3 Cross validator inside the test dataset.
+
+After these modifications, we have to define a mtd file for each file we want to read. This mtd file has to be in the same directory as the dataset.
+In this particular example, the dataset is split into two files "train_data.csv" and "test_data.csv". We want to read both, which means that we will define a mtd file for
+each of them. Those files will be called "train_data.csv.mtd" and "test_data.csv.mtd".
+In these files, we can define certain properties that the file has and also specify which values are supposed to get treated like missing values.
+
+The content of the train_data.csv.mtd file is:
+
+.. code-block::
+
+    {
+    "data_type": "frame",
+    "format": "csv",
+    "header": true,
+    "naStrings": [ "?", "" ],
+    "rows": 32561,
+    "cols": 15
+    }
+
+The "format" of the file is csv, and "header" is set to true because we added the feature names as headers to the csv files.
+The value "data_type" is set to frame, as the preprocessing functions that we use require this datatype.
+The value of "naStrings" is a list of all the string values that should be treated as unknown values during the preprocessing.
+Also, "rows" in our example is set to 32561, as we have this many entries and "cols" is set to 15 as we have 14 features, and one label column inside the files. We will later show how we can split them.
+
+After these requirements are completed, we have to define a SystemDSContext for reading our dataset. We can do this in the following way:
+
+.. code-block:: python
+
+    sds = SystemDSContext()
+
+    train_count = 32561
+    test_count = 16281
+
+With this context we can now define a read operation using the path of the dataset and a schema.
+The schema simply defines the data types for each column.
+
+As already mentioned, SystemDS supports lazy execution by default, which means that the read operation is only executed after calling the compute() function.
+
+.. code-block:: python
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+
+    F1 = sds.read(
+        dataset_path_train,
+        schema=SCHEMA
+    )
+    F2 = sds.read(
+        dataset_path_test,
+        schema=SCHEMA
+    )
+
+Step 2: Defining preprocess operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the read operation has been declared, we can define an additional file for the further preprocessing of the dataset.
+For this, we create a .json file that holds information about the operations that will be performed on individual columns.
+For the sake of this tutorial we will use the file "jspec.json" with the following content:
+
+.. code-block::
+
+    {
+    "impute":
+    [ { "name": "age", "method": "global_mean" }
+     ,{ "name": "workclass" , "method": "global_mode" }
+     ,{ "name": "fnlwgt", "method": "global_mean" }
+     ,{ "name": "education", "method": "global_mode"  }
+     ,{ "name": "education-num", "method": "global_mean" }
+     ,{ "name": "marital-status"      , "method": "global_mode" }
+     ,{ "name": "occupation"        , "method": "global_mode" }
+     ,{ "name": "relationship" , "method": "global_mode" }
+     ,{ "name": "race"        , "method": "global_mode" }
+     ,{ "name": "sex"        , "method": "global_mode" }
+     ,{ "name": "capital-gain", "method": "global_mean" }
+     ,{ "name": "capital-loss", "method": "global_mean" }
+     ,{ "name": "hours-per-week", "method": "global_mean" }
+     ,{ "name": "native-country"        , "method": "global_mode" }
+    ],
+    "bin": [ { "name": "age"  , "method": "equi-width", "numbins": 3 }],
+    "dummycode": ["age", "workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"],
+    "recode": ["income"]
+    }
+
+Our dataset has missing values. An easy way to deal with that circumstance is to use the "impute" option that SystemDS supports.
+We simply pass a list that holds all the relations between column names and the method of interpolation. A more specific example  is the "education" column.
+In the dataset certain entries have missing values for this column. As this is a string feature,
+we can simply define the method as "global_mode" and replace every missing value with the global mode inside this column. It is important to note that
+we first had to define the values of the missing strings in our selected dataset using the .mtd files (naStrings": [ "?", "" ]).
+
+With the "bin" keyword we can discretize continuous values into a small number of bins. Here the column with age values
+is discretized into three age intervals. The only method that is currently supported is equi-width binning.
+
+The column-level data transformation "dummycode" allows us to one-hot-encode a column.
+In our example we first bin the "age" column into 3 different bins. This means that we now have one column where one entry can belong to one of 3 age groups. After using
+"dummycode", we transform this one column into 3 different columns, one for each bin.
+
+At last, we make use of the "recode" transformation for categorical columns, it maps all distinct categories in
+the column into consecutive numbers, starting from 1. In our example we recode the "income" column, which
+transforms it from "<=$50K" and ">$50K" to "1" and "2" respectively.
+
+Another good resource for further ways of processing is: https://apache.github.io/systemds/site/dml-language-reference.html
+
+There we provide different examples for defining jspec's and what functionality is currently supported.
+
+After defining the .jspec file we can read it by passing the filepath, data_type and value_type using the following command:
+
+.. code-block:: python
+
+    dataset_jspec = "adult/jspec.json"
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+
+Finally, we need to define a custom dml file to split the features from the labels and replace certain values, which we will need later.
+We will call this file "preprocess.dml":
+
+.. code-block::
+
+    get_X = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,1:ncol(X)-1]
+    }
+    get_Y = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,ncol(X):ncol(X)]

Review comment:
       ah i see, you make this function to enable slicing columns out.
   
   I will add a slice function later, than remove the need of using this.
   

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+The multiLogRegPredict function has three return values:
+    - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example.
+    - y_pred, is the predictions made using the model
+    - acc, is the accuracy achieved by the model.
+
+Step 3: Confusion Matrix
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding
+which classes the model has difficulties separating.
+The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix
+for the predictions and the confusion matrix averages of each true class.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import confusionMatrix
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+In the full script, some steps are combined to reduce the overall script.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 2
+-------
+
+This part of the tutorial shows an overview of the preprocessing capabilities that SystemDS has to offer.
+We will take an unprocessed dataset using the csv format and read it with SystemDS. Then do the heavy lifting for the preprocessing with SystemDS.
+As mentioned before, we want to use the Adult dataset for this task.
+If one wants to skip the explanation, then the full script is available at the end of this level.
+
+Step 1: Metadata and reading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First of all, we need to download the dataset and create a mtd-file for specifying different metadata about the dataset.
+We download the train and test dataset from: https://archive.ics.uci.edu/ml/datasets/adult
+
+The downloaded dataset will be slightly modified for convenience. These modifications entail removing unnecessary newlines at the end of the files and
+adding column names at the top of the files such that the first line looks like:
+
+.. code-block::
+
+    age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
+
+We also delete the line holding the string value |1x3 Cross validator inside the test dataset.
+
+After these modifications, we have to define a mtd file for each file we want to read. This mtd file has to be in the same directory as the dataset.
+In this particular example, the dataset is split into two files "train_data.csv" and "test_data.csv". We want to read both, which means that we will define a mtd file for
+each of them. Those files will be called "train_data.csv.mtd" and "test_data.csv.mtd".
+In these files, we can define certain properties that the file has and also specify which values are supposed to get treated like missing values.
+
+The content of the train_data.csv.mtd file is:
+
+.. code-block::
+
+    {
+    "data_type": "frame",
+    "format": "csv",
+    "header": true,
+    "naStrings": [ "?", "" ],
+    "rows": 32561,
+    "cols": 15
+    }
+
+The "format" of the file is csv, and "header" is set to true because we added the feature names as headers to the csv files.
+The value "data_type" is set to frame, as the preprocessing functions that we use require this datatype.
+The value of "naStrings" is a list of all the string values that should be treated as unknown values during the preprocessing.
+Also, "rows" in our example is set to 32561, as we have this many entries and "cols" is set to 15 as we have 14 features, and one label column inside the files. We will later show how we can split them.
+
+After these requirements are completed, we have to define a SystemDSContext for reading our dataset. We can do this in the following way:
+
+.. code-block:: python
+
+    sds = SystemDSContext()
+
+    train_count = 32561
+    test_count = 16281
+
+With this context we can now define a read operation using the path of the dataset and a schema.
+The schema simply defines the data types for each column.
+
+As already mentioned, SystemDS supports lazy execution by default, which means that the read operation is only executed after calling the compute() function.
+
+.. code-block:: python
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+
+    F1 = sds.read(
+        dataset_path_train,
+        schema=SCHEMA
+    )
+    F2 = sds.read(
+        dataset_path_test,
+        schema=SCHEMA
+    )
+
+Step 2: Defining preprocess operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the read operation has been declared, we can define an additional file for the further preprocessing of the dataset.
+For this, we create a .json file that holds information about the operations that will be performed on individual columns.
+For the sake of this tutorial we will use the file "jspec.json" with the following content:
+
+.. code-block::
+
+    {
+    "impute":
+    [ { "name": "age", "method": "global_mean" }
+     ,{ "name": "workclass" , "method": "global_mode" }
+     ,{ "name": "fnlwgt", "method": "global_mean" }
+     ,{ "name": "education", "method": "global_mode"  }
+     ,{ "name": "education-num", "method": "global_mean" }
+     ,{ "name": "marital-status"      , "method": "global_mode" }
+     ,{ "name": "occupation"        , "method": "global_mode" }
+     ,{ "name": "relationship" , "method": "global_mode" }
+     ,{ "name": "race"        , "method": "global_mode" }
+     ,{ "name": "sex"        , "method": "global_mode" }
+     ,{ "name": "capital-gain", "method": "global_mean" }
+     ,{ "name": "capital-loss", "method": "global_mean" }
+     ,{ "name": "hours-per-week", "method": "global_mean" }
+     ,{ "name": "native-country"        , "method": "global_mode" }
+    ],
+    "bin": [ { "name": "age"  , "method": "equi-width", "numbins": 3 }],
+    "dummycode": ["age", "workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"],
+    "recode": ["income"]
+    }
+
+Our dataset has missing values. An easy way to deal with that circumstance is to use the "impute" option that SystemDS supports.
+We simply pass a list that holds all the relations between column names and the method of interpolation. A more specific example  is the "education" column.
+In the dataset certain entries have missing values for this column. As this is a string feature,
+we can simply define the method as "global_mode" and replace every missing value with the global mode inside this column. It is important to note that
+we first had to define the values of the missing strings in our selected dataset using the .mtd files (naStrings": [ "?", "" ]).
+
+With the "bin" keyword we can discretize continuous values into a small number of bins. Here the column with age values
+is discretized into three age intervals. The only method that is currently supported is equi-width binning.
+
+The column-level data transformation "dummycode" allows us to one-hot-encode a column.
+In our example we first bin the "age" column into 3 different bins. This means that we now have one column where one entry can belong to one of 3 age groups. After using
+"dummycode", we transform this one column into 3 different columns, one for each bin.
+
+At last, we make use of the "recode" transformation for categorical columns, it maps all distinct categories in
+the column into consecutive numbers, starting from 1. In our example we recode the "income" column, which
+transforms it from "<=$50K" and ">$50K" to "1" and "2" respectively.
+
+Another good resource for further ways of processing is: https://apache.github.io/systemds/site/dml-language-reference.html
+
+There we provide different examples for defining jspec's and what functionality is currently supported.
+
+After defining the .jspec file we can read it by passing the filepath, data_type and value_type using the following command:
+
+.. code-block:: python
+
+    dataset_jspec = "adult/jspec.json"
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+
+Finally, we need to define a custom dml file to split the features from the labels and replace certain values, which we will need later.
+We will call this file "preprocess.dml":
+
+.. code-block::
+
+    get_X = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,1:ncol(X)-1]
+    }
+    get_Y = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,ncol(X):ncol(X)]
+    }
+    replace_value = function(matrix[double] X, double pattern , double replacement)
+    return (matrix[double] returnVal) {
+    returnVal = replace(target=X, pattern=pattern, replacement=replacement)
+    }
+
+The get_X function simply extracts every column except the last one and can also be used to pick certain slices from the dataset.
+The get_Y function only extracts the last column, which in our case holds the labels. Replace_value is used to replace a double value with another double.
+The preprocess.dml file can be read with the following command:
+
+.. code-block:: python
+
+    preprocess_src_path = "preprocess.dml"
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+The print_imported_methods flag can be used to verify whether every method has been parsed correctly.
+
+Step 3: Applying the preprocessing steps
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Generally speaking, we would use the transform_encode function on the train dataset and with the returned encoding call the transform_apply function on the test dataset.
+In the case of the Adult dataset, we have inconsistent label names inside the test dataset and the train dataset, which is why we will show how we can deal with that using SystemDS.
+First of all, we combine the train and the test dataset by using the rbind() function. This function simply appends the Frame F2 at the end of Frame F1.
+This is necessary to ensure that the encoding is identical between train and test dataset.
+
+.. code-block:: python
+
+    X1 = F1.rbind(F2)
+
+In order to use our jspec file we can apply the transform_encode() function. We simply have to pass the read .json file from before.
+In our particular case we obtain the Matrix X1 and the Frame M1 from the operation. X1 holds all the encoded values and M1 holds a mapping between the encoded values
+and all the initial values. Columns that have not been specified in the .json file were not altered.
+
+.. code-block:: python
+
+    X1, M1 = X1.transform_encode(spec=jspec)
+
+We now can use the previously parsed dml file for splitting the dataset and unifying the inconsistent labels. It is noteworthy that the
+file is parsed such that we can directly call the function names from the Python API.
+
+.. code-block:: python
+
+    X = PREPROCESS_package.get_X(X1, 1, train_count)
+    Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+    Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+    Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+    Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+    Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+Step 4: Training and confusion matrix
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that we prepared the data we can use the multiLogReg function.
+These steps are identical to step 2 and 3 that have already been described in level 1 of this tutorial.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    from systemds.operator.algorithm import confusionMatrix
+    from systemds.operator.algorithm import multiLogRegPredict
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+The complete script now can be seen here:
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    train_count = 32561
+    test_count = 16281
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+    dataset_jspec = "adult/jspec.json"
+    preprocess_src_path = "preprocess.dml"
+
+    sds = SystemDSContext()
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    F1 = sds.read(dataset_path_train, schema=SCHEMA)
+    F2 = sds.read(dataset_path_test,  schema=SCHEMA)
+
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+    X1 = F1.rbind(F2)
+    X1, M1 = X1.transform_encode(spec=jspec)
+
+    X = PREPROCESS_package.get_X(X1, 1, train_count)
+    Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+    Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+    Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+    Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+    Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+    betas = multiLogReg(X, Y)
+
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 3
+-------
+
+In this level we want to show how we can integrate a custom built algorithm using the Python API.
+For this we will introduce another dml file, which can be used to train a basic feed forward network.
+
+Step 1: Obtain data
+~~~~~~~~~~~~~~~~~~~
+
+For the whole data setup please refer to level 2, Step 1 to 3, as these steps are identical.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+
+    train_count = 32561
+    test_count = 16281
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+    dataset_jspec = "adult/jspec.json"
+    preprocess_src_path = "preprocess.dml"
+
+    sds = SystemDSContext()
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    F1 = sds.read(dataset_path_train, schema=SCHEMA)
+    F2 = sds.read(dataset_path_test,  schema=SCHEMA)
+
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+    X1 = F1.rbind(F2)
+    X1, M1 = X1.transform_encode(spec=jspec)
+
+    X = PREPROCESS_package.get_X(X1, 1, train_count)
+    Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+    Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+    Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+    Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+    Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+Step 2: Load the algorithm
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We use a neural network with 2 hidden layers, each consisting of 200 neurons.
+First, we need to source the dml file for neural networks.
+This file includes all the necessary functions for training, evaluating, and storing the model.
+The returned object of the source call is further used for calling the functions.
+The file can be found here:
+
+    - :doc:tests/examples/tutorials/neural_net_source.dml
+
+.. code-block:: python
+
+    FFN_package = sds.source(neural_net_src_path, "fnn", print_imported_methods=True))
+
+Step 3: Training the neural network
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Training a neural network in SystemDS using the train function is straightforward.
+The first two arguments are the training features and the target values we want to fit our model on.
+Then we need to set the hyperparameters of the model.
+We choose to train for 1 epoch with a batch size of 16 and a learning rate of 0.01, which are common parameters for neural networks.
+The seed argument ensures that running the code again yields the same results.
+
+.. code-block:: python
+
+    epochs = 1
+    batch_size = 16
+    learning_rate = 0.01
+    seed = 42

Review comment:
       42 nice

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+The multiLogRegPredict function has three return values:
+    - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example.
+    - y_pred, is the predictions made using the model
+    - acc, is the accuracy achieved by the model.
+
+Step 3: Confusion Matrix
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding
+which classes the model has difficulties separating.
+The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix
+for the predictions and the confusion matrix averages of each true class.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import confusionMatrix
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+In the full script, some steps are combined to reduce the overall script.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 2
+-------
+
+This part of the tutorial shows an overview of the preprocessing capabilities that SystemDS has to offer.
+We will take an unprocessed dataset using the csv format and read it with SystemDS. Then do the heavy lifting for the preprocessing with SystemDS.
+As mentioned before, we want to use the Adult dataset for this task.
+If one wants to skip the explanation, then the full script is available at the end of this level.
+
+Step 1: Metadata and reading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First of all, we need to download the dataset and create a mtd-file for specifying different metadata about the dataset.
+We download the train and test dataset from: https://archive.ics.uci.edu/ml/datasets/adult
+
+The downloaded dataset will be slightly modified for convenience. These modifications entail removing unnecessary newlines at the end of the files and
+adding column names at the top of the files such that the first line looks like:
+
+.. code-block::
+
+    age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
+
+We also delete the line holding the string value |1x3 Cross validator inside the test dataset.
+
+After these modifications, we have to define a mtd file for each file we want to read. This mtd file has to be in the same directory as the dataset.
+In this particular example, the dataset is split into two files "train_data.csv" and "test_data.csv". We want to read both, which means that we will define a mtd file for
+each of them. Those files will be called "train_data.csv.mtd" and "test_data.csv.mtd".
+In these files, we can define certain properties that the file has and also specify which values are supposed to get treated like missing values.
+
+The content of the train_data.csv.mtd file is:
+
+.. code-block::
+
+    {
+    "data_type": "frame",
+    "format": "csv",
+    "header": true,
+    "naStrings": [ "?", "" ],
+    "rows": 32561,
+    "cols": 15
+    }
+
+The "format" of the file is csv, and "header" is set to true because we added the feature names as headers to the csv files.
+The value "data_type" is set to frame, as the preprocessing functions that we use require this datatype.
+The value of "naStrings" is a list of all the string values that should be treated as unknown values during the preprocessing.
+Also, "rows" in our example is set to 32561, as we have this many entries and "cols" is set to 15 as we have 14 features, and one label column inside the files. We will later show how we can split them.
+
+After these requirements are completed, we have to define a SystemDSContext for reading our dataset. We can do this in the following way:
+
+.. code-block:: python
+
+    sds = SystemDSContext()
+
+    train_count = 32561
+    test_count = 16281
+
+With this context we can now define a read operation using the path of the dataset and a schema.
+The schema simply defines the data types for each column.
+
+As already mentioned, SystemDS supports lazy execution by default, which means that the read operation is only executed after calling the compute() function.
+
+.. code-block:: python
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+
+    F1 = sds.read(
+        dataset_path_train,
+        schema=SCHEMA
+    )
+    F2 = sds.read(
+        dataset_path_test,
+        schema=SCHEMA
+    )
+
+Step 2: Defining preprocess operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the read operation has been declared, we can define an additional file for the further preprocessing of the dataset.
+For this, we create a .json file that holds information about the operations that will be performed on individual columns.
+For the sake of this tutorial we will use the file "jspec.json" with the following content:
+
+.. code-block::
+
+    {
+    "impute":
+    [ { "name": "age", "method": "global_mean" }
+     ,{ "name": "workclass" , "method": "global_mode" }
+     ,{ "name": "fnlwgt", "method": "global_mean" }
+     ,{ "name": "education", "method": "global_mode"  }
+     ,{ "name": "education-num", "method": "global_mean" }
+     ,{ "name": "marital-status"      , "method": "global_mode" }
+     ,{ "name": "occupation"        , "method": "global_mode" }
+     ,{ "name": "relationship" , "method": "global_mode" }
+     ,{ "name": "race"        , "method": "global_mode" }
+     ,{ "name": "sex"        , "method": "global_mode" }
+     ,{ "name": "capital-gain", "method": "global_mean" }
+     ,{ "name": "capital-loss", "method": "global_mean" }
+     ,{ "name": "hours-per-week", "method": "global_mean" }
+     ,{ "name": "native-country"        , "method": "global_mode" }
+    ],
+    "bin": [ { "name": "age"  , "method": "equi-width", "numbins": 3 }],
+    "dummycode": ["age", "workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"],
+    "recode": ["income"]
+    }
+
+Our dataset has missing values. An easy way to deal with that circumstance is to use the "impute" option that SystemDS supports.
+We simply pass a list that holds all the relations between column names and the method of interpolation. A more specific example  is the "education" column.
+In the dataset certain entries have missing values for this column. As this is a string feature,
+we can simply define the method as "global_mode" and replace every missing value with the global mode inside this column. It is important to note that
+we first had to define the values of the missing strings in our selected dataset using the .mtd files (naStrings": [ "?", "" ]).
+
+With the "bin" keyword we can discretize continuous values into a small number of bins. Here the column with age values
+is discretized into three age intervals. The only method that is currently supported is equi-width binning.
+
+The column-level data transformation "dummycode" allows us to one-hot-encode a column.
+In our example we first bin the "age" column into 3 different bins. This means that we now have one column where one entry can belong to one of 3 age groups. After using
+"dummycode", we transform this one column into 3 different columns, one for each bin.
+
+At last, we make use of the "recode" transformation for categorical columns, it maps all distinct categories in
+the column into consecutive numbers, starting from 1. In our example we recode the "income" column, which
+transforms it from "<=$50K" and ">$50K" to "1" and "2" respectively.
+
+Another good resource for further ways of processing is: https://apache.github.io/systemds/site/dml-language-reference.html
+
+There we provide different examples for defining jspec's and what functionality is currently supported.
+
+After defining the .jspec file we can read it by passing the filepath, data_type and value_type using the following command:
+
+.. code-block:: python
+
+    dataset_jspec = "adult/jspec.json"
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+
+Finally, we need to define a custom dml file to split the features from the labels and replace certain values, which we will need later.
+We will call this file "preprocess.dml":
+
+.. code-block::
+
+    get_X = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,1:ncol(X)-1]
+    }
+    get_Y = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,ncol(X):ncol(X)]
+    }
+    replace_value = function(matrix[double] X, double pattern , double replacement)
+    return (matrix[double] returnVal) {
+    returnVal = replace(target=X, pattern=pattern, replacement=replacement)
+    }
+
+The get_X function simply extracts every column except the last one and can also be used to pick certain slices from the dataset.
+The get_Y function only extracts the last column, which in our case holds the labels. Replace_value is used to replace a double value with another double.
+The preprocess.dml file can be read with the following command:
+
+.. code-block:: python
+
+    preprocess_src_path = "preprocess.dml"
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+The print_imported_methods flag can be used to verify whether every method has been parsed correctly.
+
+Step 3: Applying the preprocessing steps
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Generally speaking, we would use the transform_encode function on the train dataset and with the returned encoding call the transform_apply function on the test dataset.
+In the case of the Adult dataset, we have inconsistent label names inside the test dataset and the train dataset, which is why we will show how we can deal with that using SystemDS.
+First of all, we combine the train and the test dataset by using the rbind() function. This function simply appends the Frame F2 at the end of Frame F1.
+This is necessary to ensure that the encoding is identical between train and test dataset.
+
+.. code-block:: python
+
+    X1 = F1.rbind(F2)
+
+In order to use our jspec file we can apply the transform_encode() function. We simply have to pass the read .json file from before.
+In our particular case we obtain the Matrix X1 and the Frame M1 from the operation. X1 holds all the encoded values and M1 holds a mapping between the encoded values
+and all the initial values. Columns that have not been specified in the .json file were not altered.
+
+.. code-block:: python
+
+    X1, M1 = X1.transform_encode(spec=jspec)
+
+We now can use the previously parsed dml file for splitting the dataset and unifying the inconsistent labels. It is noteworthy that the
+file is parsed such that we can directly call the function names from the Python API.
+
+.. code-block:: python
+
+    X = PREPROCESS_package.get_X(X1, 1, train_count)
+    Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+    Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+    Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+    Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+    Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+Step 4: Training and confusion matrix
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that we prepared the data we can use the multiLogReg function.
+These steps are identical to step 2 and 3 that have already been described in level 1 of this tutorial.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    from systemds.operator.algorithm import confusionMatrix
+    from systemds.operator.algorithm import multiLogRegPredict
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+The complete script now can be seen here:
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    train_count = 32561
+    test_count = 16281
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+    dataset_jspec = "adult/jspec.json"
+    preprocess_src_path = "preprocess.dml"
+
+    sds = SystemDSContext()
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    F1 = sds.read(dataset_path_train, schema=SCHEMA)
+    F2 = sds.read(dataset_path_test,  schema=SCHEMA)
+
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+    X1 = F1.rbind(F2)
+    X1, M1 = X1.transform_encode(spec=jspec)
+
+    X = PREPROCESS_package.get_X(X1, 1, train_count)
+    Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+    Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+    Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)

Review comment:
       It would be nicer if we use the split function to construct the train and test datasets,

##########
File path: src/main/python/docs/source/guide/python_end_to_end_tut.rst
##########
@@ -0,0 +1,561 @@
+.. -------------------------------------------------------------
+.. 
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+.. 
+..   http://www.apache.org/licenses/LICENSE-2.0
+.. 
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+.. 
+.. ------------------------------------------------------------
+
+Python end-to-end tutorial
+==========================
+
+The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API.
+For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data.
+The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K.
+Most of these features are categorical string values, but the dataset also includes continuous features.
+For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS.
+In the first level, we simply get an already preprocessed dataset from a DatasetManager.
+The second level, shows the built-in preprocessing capabilities of SystemDS.
+With the third level, we want to show how we can integrate custom-built networks or algorithms into our Python program.
+
+Prerequisite: 
+
+- :doc:`/getting_started/install`
+
+Level 1
+-------
+
+This example shows how one can work with NumPy data within the SystemDS framework. More precisely, we will make use of the
+built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this
+tutorial is a preprocessed version of the "UCI Adult Data Set". If you are interested in data preprocessing, take a look at level 2.
+If one wants to skip the explanation then the full script is available at the end of this level.
+
+We will train a Multinomial Logistic Regression model on the training dataset and subsequently we will use the test dataset
+to assess how well our model can predict if the income is above or below $50K/yr based on the features.
+
+Step 1: Load and prepare data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the
+labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take
+a fraction of the training and test set into account to speed up the execution.
+
+.. code-block:: python
+
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+Here the DataManager contains the code for downloading and setting up NumPy arrays containing the data.
+It is noteworthy that the function get_preprocessed_dataset has options for basic standardization, interpolation, and combining categorical features inside one column whose occurrences are below a certain threshold.
+
+Step 2: Training
+~~~~~~~~~~~~~~~~
+
+Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our
+training data. Afterward, we can make predictions on the test data and assess the performance of the model.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    betas = multiLogReg(X, Y)
+
+Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called.
+E.g. betas_res = betas.compute().
+
+We can now use the trained model to make predictions on the test data.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogRegPredict
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+The multiLogRegPredict function has three return values:
+    - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example.
+    - y_pred, is the predictions made using the model
+    - acc, is the accuracy achieved by the model.
+
+Step 3: Confusion Matrix
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding
+which classes the model has difficulties separating.
+The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix
+for the predictions and the confusion matrix averages of each true class.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import confusionMatrix
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+In the full script, some steps are combined to reduce the overall script.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.examples.tutorials.adult import DataManager
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    sds = SystemDSContext()
+    d = DataManager()
+
+    # limit the sample size
+    train_count = 15000
+    test_count = 5000
+
+    train_data, train_labels, test_data, test_labels = d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+    # Train data
+    X = sds.from_numpy(train_data[:train_count])
+    Y = sds.from_numpy(train_labels[:train_count])
+    Y = Y + 1.0
+
+    # Test data
+    Xt = sds.from_numpy(test_data[:test_count])
+    Yt = sds.from_numpy(test_labels[:test_count])
+    Yt = Yt + 1.0
+
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 2
+-------
+
+This part of the tutorial shows an overview of the preprocessing capabilities that SystemDS has to offer.
+We will take an unprocessed dataset using the csv format and read it with SystemDS. Then do the heavy lifting for the preprocessing with SystemDS.
+As mentioned before, we want to use the Adult dataset for this task.
+If one wants to skip the explanation, then the full script is available at the end of this level.
+
+Step 1: Metadata and reading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First of all, we need to download the dataset and create a mtd-file for specifying different metadata about the dataset.
+We download the train and test dataset from: https://archive.ics.uci.edu/ml/datasets/adult
+
+The downloaded dataset will be slightly modified for convenience. These modifications entail removing unnecessary newlines at the end of the files and
+adding column names at the top of the files such that the first line looks like:
+
+.. code-block::
+
+    age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
+
+We also delete the line holding the string value |1x3 Cross validator inside the test dataset.
+
+After these modifications, we have to define a mtd file for each file we want to read. This mtd file has to be in the same directory as the dataset.
+In this particular example, the dataset is split into two files "train_data.csv" and "test_data.csv". We want to read both, which means that we will define a mtd file for
+each of them. Those files will be called "train_data.csv.mtd" and "test_data.csv.mtd".
+In these files, we can define certain properties that the file has and also specify which values are supposed to get treated like missing values.
+
+The content of the train_data.csv.mtd file is:
+
+.. code-block::
+
+    {
+    "data_type": "frame",
+    "format": "csv",
+    "header": true,
+    "naStrings": [ "?", "" ],
+    "rows": 32561,
+    "cols": 15
+    }
+
+The "format" of the file is csv, and "header" is set to true because we added the feature names as headers to the csv files.
+The value "data_type" is set to frame, as the preprocessing functions that we use require this datatype.
+The value of "naStrings" is a list of all the string values that should be treated as unknown values during the preprocessing.
+Also, "rows" in our example is set to 32561, as we have this many entries and "cols" is set to 15 as we have 14 features, and one label column inside the files. We will later show how we can split them.
+
+After these requirements are completed, we have to define a SystemDSContext for reading our dataset. We can do this in the following way:
+
+.. code-block:: python
+
+    sds = SystemDSContext()
+
+    train_count = 32561
+    test_count = 16281
+
+With this context we can now define a read operation using the path of the dataset and a schema.
+The schema simply defines the data types for each column.
+
+As already mentioned, SystemDS supports lazy execution by default, which means that the read operation is only executed after calling the compute() function.
+
+.. code-block:: python
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+
+    F1 = sds.read(
+        dataset_path_train,
+        schema=SCHEMA
+    )
+    F2 = sds.read(
+        dataset_path_test,
+        schema=SCHEMA
+    )
+
+Step 2: Defining preprocess operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the read operation has been declared, we can define an additional file for the further preprocessing of the dataset.
+For this, we create a .json file that holds information about the operations that will be performed on individual columns.
+For the sake of this tutorial we will use the file "jspec.json" with the following content:
+
+.. code-block::
+
+    {
+    "impute":
+    [ { "name": "age", "method": "global_mean" }
+     ,{ "name": "workclass" , "method": "global_mode" }
+     ,{ "name": "fnlwgt", "method": "global_mean" }
+     ,{ "name": "education", "method": "global_mode"  }
+     ,{ "name": "education-num", "method": "global_mean" }
+     ,{ "name": "marital-status"      , "method": "global_mode" }
+     ,{ "name": "occupation"        , "method": "global_mode" }
+     ,{ "name": "relationship" , "method": "global_mode" }
+     ,{ "name": "race"        , "method": "global_mode" }
+     ,{ "name": "sex"        , "method": "global_mode" }
+     ,{ "name": "capital-gain", "method": "global_mean" }
+     ,{ "name": "capital-loss", "method": "global_mean" }
+     ,{ "name": "hours-per-week", "method": "global_mean" }
+     ,{ "name": "native-country"        , "method": "global_mode" }
+    ],
+    "bin": [ { "name": "age"  , "method": "equi-width", "numbins": 3 }],
+    "dummycode": ["age", "workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"],
+    "recode": ["income"]
+    }
+
+Our dataset has missing values. An easy way to deal with that circumstance is to use the "impute" option that SystemDS supports.
+We simply pass a list that holds all the relations between column names and the method of interpolation. A more specific example  is the "education" column.
+In the dataset certain entries have missing values for this column. As this is a string feature,
+we can simply define the method as "global_mode" and replace every missing value with the global mode inside this column. It is important to note that
+we first had to define the values of the missing strings in our selected dataset using the .mtd files (naStrings": [ "?", "" ]).
+
+With the "bin" keyword we can discretize continuous values into a small number of bins. Here the column with age values
+is discretized into three age intervals. The only method that is currently supported is equi-width binning.
+
+The column-level data transformation "dummycode" allows us to one-hot-encode a column.
+In our example we first bin the "age" column into 3 different bins. This means that we now have one column where one entry can belong to one of 3 age groups. After using
+"dummycode", we transform this one column into 3 different columns, one for each bin.
+
+At last, we make use of the "recode" transformation for categorical columns, it maps all distinct categories in
+the column into consecutive numbers, starting from 1. In our example we recode the "income" column, which
+transforms it from "<=$50K" and ">$50K" to "1" and "2" respectively.
+
+Another good resource for further ways of processing is: https://apache.github.io/systemds/site/dml-language-reference.html
+
+There we provide different examples for defining jspec's and what functionality is currently supported.
+
+After defining the .jspec file we can read it by passing the filepath, data_type and value_type using the following command:
+
+.. code-block:: python
+
+    dataset_jspec = "adult/jspec.json"
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+
+Finally, we need to define a custom dml file to split the features from the labels and replace certain values, which we will need later.
+We will call this file "preprocess.dml":
+
+.. code-block::
+
+    get_X = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,1:ncol(X)-1]
+    }
+    get_Y = function(matrix[double] X, int start, int stop)
+    return (matrix[double] returnVal) {
+    returnVal = X[start:stop,ncol(X):ncol(X)]
+    }
+    replace_value = function(matrix[double] X, double pattern , double replacement)
+    return (matrix[double] returnVal) {
+    returnVal = replace(target=X, pattern=pattern, replacement=replacement)
+    }
+
+The get_X function simply extracts every column except the last one and can also be used to pick certain slices from the dataset.
+The get_Y function only extracts the last column, which in our case holds the labels. Replace_value is used to replace a double value with another double.
+The preprocess.dml file can be read with the following command:
+
+.. code-block:: python
+
+    preprocess_src_path = "preprocess.dml"
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+The print_imported_methods flag can be used to verify whether every method has been parsed correctly.
+
+Step 3: Applying the preprocessing steps
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Generally speaking, we would use the transform_encode function on the train dataset and with the returned encoding call the transform_apply function on the test dataset.
+In the case of the Adult dataset, we have inconsistent label names inside the test dataset and the train dataset, which is why we will show how we can deal with that using SystemDS.
+First of all, we combine the train and the test dataset by using the rbind() function. This function simply appends the Frame F2 at the end of Frame F1.
+This is necessary to ensure that the encoding is identical between train and test dataset.
+
+.. code-block:: python
+
+    X1 = F1.rbind(F2)
+
+In order to use our jspec file we can apply the transform_encode() function. We simply have to pass the read .json file from before.
+In our particular case we obtain the Matrix X1 and the Frame M1 from the operation. X1 holds all the encoded values and M1 holds a mapping between the encoded values
+and all the initial values. Columns that have not been specified in the .json file were not altered.
+
+.. code-block:: python
+
+    X1, M1 = X1.transform_encode(spec=jspec)
+
+We now can use the previously parsed dml file for splitting the dataset and unifying the inconsistent labels. It is noteworthy that the
+file is parsed such that we can directly call the function names from the Python API.
+
+.. code-block:: python
+
+    X = PREPROCESS_package.get_X(X1, 1, train_count)
+    Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+    Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+    Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+    Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+    Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+Step 4: Training and confusion matrix
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that we prepared the data we can use the multiLogReg function.
+These steps are identical to step 2 and 3 that have already been described in level 1 of this tutorial.
+
+.. code-block:: python
+
+    from systemds.operator.algorithm import multiLogReg
+    from systemds.operator.algorithm import confusionMatrix
+    from systemds.operator.algorithm import multiLogRegPredict
+    betas = multiLogReg(X, Y)
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Full Script
+~~~~~~~~~~~
+
+The complete script now can be seen here:
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+    from systemds.operator.algorithm import multiLogReg, multiLogRegPredict, confusionMatrix
+
+    train_count = 32561
+    test_count = 16281
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+    dataset_jspec = "adult/jspec.json"
+    preprocess_src_path = "preprocess.dml"
+
+    sds = SystemDSContext()
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    F1 = sds.read(dataset_path_train, schema=SCHEMA)
+    F2 = sds.read(dataset_path_test,  schema=SCHEMA)
+
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+    X1 = F1.rbind(F2)
+    X1, M1 = X1.transform_encode(spec=jspec)
+
+    X = PREPROCESS_package.get_X(X1, 1, train_count)
+    Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+    Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+    Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+    Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+    Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+    betas = multiLogReg(X, Y)
+
+    [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt)
+
+    confusion_matrix_abs, _ = confusionMatrix(y_pred, Yt).compute()
+    print(confusion_matrix_abs)
+
+Level 3
+-------
+
+In this level we want to show how we can integrate a custom built algorithm using the Python API.
+For this we will introduce another dml file, which can be used to train a basic feed forward network.
+
+Step 1: Obtain data
+~~~~~~~~~~~~~~~~~~~
+
+For the whole data setup please refer to level 2, Step 1 to 3, as these steps are identical.
+
+.. code-block:: python
+
+    import numpy as np
+    from systemds.context import SystemDSContext
+
+    train_count = 32561
+    test_count = 16281
+
+    dataset_path_train = "adult/train_data.csv"
+    dataset_path_test = "adult/test_data.csv"
+    dataset_jspec = "adult/jspec.json"
+    preprocess_src_path = "preprocess.dml"
+
+    sds = SystemDSContext()
+
+    SCHEMA = '"DOUBLE,STRING,DOUBLE,STRING,DOUBLE,STRING,STRING,STRING,STRING,STRING,DOUBLE,DOUBLE,DOUBLE,STRING,STRING"'
+
+    F1 = sds.read(dataset_path_train, schema=SCHEMA)
+    F2 = sds.read(dataset_path_test,  schema=SCHEMA)
+
+    jspec = sds.read(dataset_jspec, data_type="scalar", value_type="string")
+    PREPROCESS_package = sds.source(preprocess_src_path, "preprocess", print_imported_methods=True)
+
+    X1 = F1.rbind(F2)
+    X1, M1 = X1.transform_encode(spec=jspec)
+
+    X = PREPROCESS_package.get_X(X1, 1, train_count)
+    Y = PREPROCESS_package.get_Y(X1, 1, train_count)
+
+    Xt = PREPROCESS_package.get_X(X1, train_count, train_count+test_count)
+    Yt = PREPROCESS_package.get_Y(X1, train_count, train_count+test_count)
+
+    Yt = PREPROCESS_package.replace_value(Yt, 3.0, 1.0)
+    Yt = PREPROCESS_package.replace_value(Yt, 4.0, 2.0)
+
+Step 2: Load the algorithm
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We use a neural network with 2 hidden layers, each consisting of 200 neurons.
+First, we need to source the dml file for neural networks.
+This file includes all the necessary functions for training, evaluating, and storing the model.
+The returned object of the source call is further used for calling the functions.
+The file can be found here:
+
+    - :doc:tests/examples/tutorials/neural_net_source.dml
+
+.. code-block:: python
+
+    FFN_package = sds.source(neural_net_src_path, "fnn", print_imported_methods=True))

Review comment:
       i would not print the included methods in the source call.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r663369719



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -0,0 +1,243 @@
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+import os
+import unittest
+
+import numpy as np
+from systemds.context import SystemDSContext
+from systemds.examples.tutorials.adult import DataManager
+from systemds.operator import OperationNode
+from systemds.operator.algorithm import kmeans, multiLogReg, multiLogRegPredict, l2svm, confusionMatrix, scale, scaleApply, split, winsorize
+from systemds.script_building import DMLScript
+
+
+class Test_DMLScript(unittest.TestCase):
+    """
+    Test class for adult dml script tutorial code.
+    """
+
+    sds: SystemDSContext = None
+    d: DataManager = None
+    neural_net_src_path: str = "tests/examples/tutorials/neural_net_source.dml"
+    dataset_path_train: str = "../../test/resources/datasets/adult/train_data.csv"
+    dataset_path_train_mtd: str = "../../test/resources/datasets/adult/train_data.csv.mtd"
+    dataset_path_test: str = "../../test/resources/datasets/adult/test_data.csv"
+    dataset_path_test_mtd: str = "../../test/resources/datasets/adult/test_data.csv.mtd"
+    dataset_jspec: str = "../../test/resources/datasets/adult/jspec.json"
+
+    @classmethod
+    def setUpClass(cls):
+        cls.sds = SystemDSContext()
+        cls.d = DataManager()
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.sds.close()
+
+    def test_train_data(self):
+        x = self.d.get_train_data()
+        self.assertEqual((32561, 14), x.shape)
+
+    def test_train_labels(self):
+        y = self.d.get_train_labels()
+        self.assertEqual((32561,), y.shape)
+
+    def test_test_data(self):
+        x_l = self.d.get_test_data()
+        self.assertEqual((16281, 14), x_l.shape)
+
+    def test_test_labels(self):
+        y_l = self.d.get_test_labels()
+        self.assertEqual((16281,), y_l.shape)
+
+    def test_preprocess(self):
+        #assumes certain preprocessing
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+        self.assertEqual((30162,104), train_data.shape)
+        self.assertEqual((30162, ), train_labels.shape)
+        self.assertEqual((15060,104), test_data.shape)
+        self.assertEqual((15060, ), test_labels.shape)
+
+    def test_multi_log_reg(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset()
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3503, 503],
+                          [268, 726]])
+            )
+        )
+
+
+    def test_multi_log_reg_interpolated_standardized(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+        Y = Y + 1.0
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+        Yt = Yt + 1.0
+
+        betas = multiLogReg(X, Y)
+
+        [_, y_pred, acc] = multiLogRegPredict(Xt, betas, Yt).compute()
+
+        self.assertGreater(acc, 80)
+        
+        confusion_matrix_abs, _ = confusionMatrix(self.sds.from_numpy(y_pred), Yt).compute()
+
+        self.assertTrue(
+            np.allclose(
+                confusion_matrix_abs,
+                np.array([[3583,  502],
+                         [245,  670]])
+            )
+        )
+
+
+    def test_neural_net(self):
+        # Reduced because we want the tests to finish a bit faster.
+        train_count = 15000
+        test_count = 5000
+
+        train_data, train_labels, test_data, test_labels = self.d.get_preprocessed_dataset(interpolate=True, standardize=True, dimred=0.1)
+
+        # Train data
+        X = self.sds.from_numpy( train_data[:train_count])
+        Y = self.sds.from_numpy( train_labels[:train_count])
+
+        # Test data
+        Xt = self.sds.from_numpy(test_data[:test_count])
+        Yt = self.sds.from_numpy(test_labels[:test_count])
+
+        FFN_package = self.sds.source(self.neural_net_src_path, "fnn", print_imported_methods=True)
+
+        network = FFN_package.train(X, Y, 1, 16, 0.01, 1)
+
+        self.assertTrue(type(network) is not None) # sourcing and training seems to works
+
+        FFN_package.save_model(network, '"model/python_FFN/"').compute(verbose=True)
+
+        # TODO This does not work yet, not sure what the problem is
+        #probs = FFN_package.predict(Xt, network).compute(True)
+        # FFN_package.eval(Yt, Yt).compute()

Review comment:
       not yet.
   i did not run it , will do next wreek




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Gandagorn commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Gandagorn commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r663786508



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -387,6 +387,11 @@ def test_level2(self):
         ################################################################################################################
         X1, M1 = X1.transform_encode(spec=jspec)
 
+        # better alternative for encoding
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+        # testX2 = X2.compute(True)

Review comment:
       @Baunsgaard
   Hi, we tried to implement a better version of the encoding using transform_apply, because otherwise we would calculate statistics for imputation on the whole data instead of just the training data.
   Unfortunately we ran into the problem that the labels in the train data are slightly different than the labels in the test data ("<=50K" != "<=50K."), which hinders us in using the encoding M for the test data. We tried different methods for correcting the labels in the test data before encoding, however the main problem is that we have not found a good way to apply changes in a column of a frame, and we are also unable to use frames as arguments for a dml script function (it seems to be not supported yet with python?).
   The simplest solution would be to remove the "." at the end of the test labels in the file itself. Other possible workarounds using just systemds functions seem to be quite complex and would probably miss the main goal of this tutorial.
   How should we proceed?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] codeyeeter commented on pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
codeyeeter commented on pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#issuecomment-889152955


   @Baunsgaard 
   We went through the comments and tried to answer all of your questions, based on your initial comment can we assume that you will do the code and formatting changes while merging? Or is there still anything expected from us?
   Best regards,
   Jörg Rainer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#issuecomment-889779745


   > @Baunsgaard
   > We went through the comments and tried to answer all of your questions, based on your initial comment can we assume that you will do the code and formatting changes while merging? Or is there still anything expected from us?
   > Best regards,
   > Jörg Rainer
   
   I will do the rest.Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Gandagorn commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Posted by GitBox <gi...@apache.org>.
Gandagorn commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r668238169



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -387,6 +387,11 @@ def test_level2(self):
         ################################################################################################################
         X1, M1 = X1.transform_encode(spec=jspec)
 
+        # better alternative for encoding
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+        # testX2 = X2.compute(True)

Review comment:
       Thank you!
   One problem still remains: The replace works for ">50K.", however it does not for "<=50K.". I played around a bit, and my guess is that the "<=" is interpreted as some kind of operator, because when I execute
   `F2 = F2.replace("\<=50K.", "<=50K")`
   I get this error
   
   > Exception       : An error occurred while calling o0.prepareScript.
   : org.apache.sysds.parser.ParseException: 
   The following 4 parse issues were encountered:
   #1 [line 2:32] [Syntax error] -> V1=replace(target=V0,pattern='\<=50K.',replacement='<=50K');
      no viable alternative at input 'V1=replace(target=V0,pattern==' ([@25,226:226='=',<7>,2:32])
   #2 [line 2:32] [Syntax error] -> V1=replace(target=V0,pattern='\<=50K.',replacement='<=50K');
      extraneous input '=' expecting {'(', '[', '-', '+', '!', 'TRUE', 'FALSE', ID, INT, DOUBLE, COMMANDLINE_NAMED_ID, COMMANDLINE_POSITION_ID, STRING} ([@25,226:226='=',<7>,2:32])
   #3 [line 2:32] [Validation error] -> V1=replace(target=V0,pattern='\<=50K.',replacement='<=50K');
      cannot parse the int value: '50K'
   #4 [line 2:35] [Syntax error] -> V1=replace(target=V0,pattern='\<=50K.',replacement='<=50K');
      extraneous input 'K' expecting {')', ',', '^', '-', '+', '%*%', '%/%', '%%', '*', '/', '>', '>=', '<', '<=', '==', '!=', '&', '&&', '|', '||'} ([@27,229:229='K',<60>,2:35])




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org