You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2021/08/01 14:39:39 UTC

[GitHub] [systemds] mboehm7 commented on a change in pull request #1357: [WIP] Clean-up of the performance test suites for binomial, multinomial and regression benchmarks

mboehm7 commented on a change in pull request #1357:
URL: https://github.com/apache/systemds/pull/1357#discussion_r680517409



##########
File path: scripts/perftest/CHANGES.md
##########
@@ -0,0 +1,48 @@
+# New additions to the performance test suite
+Most of the new files were copied from the deprecated performance test suite (scripts/perftestDeprecated) and refactored to call SystemDS with additional configuration.
+Most of the new DML scripts were copied from scripts/algorithms to scripts/perftest/scripts and then adapted to use built-in functions, if available.
+
+### General changes of perftest and the refactored files moved from perftestDeprecated to perftest
+- Added line for intel oneapi MKL system variable initialization in the matrixmult script. The initialization is commented for now, as it would be executed by the runAll.sh.
+- Added commented initialization for MKL system variables in the runAll.sh. 
+- By default, shell scripts can now be invoked without any additional parameters, but optional arguments can be given for output folder and the command to be ran (MR, SPARK, ECHO).
+- Added SystemDS-config.xml in the perftest/conf folder, which is used by all refactored perftest scripts.
+- times.txt was moved to the "results" folder in perftest.
+- Time measurements appended to results/times.txt are now additionally measured in microseconds instead of just seconds, for the smaller data benchmarks.
+- All DML scripts, that are ultimately called by the microbenchmarks, can be found in perftest/scripts. This excludes the original algorithmic scripts that are still in use, if there was no corresponding built-in function.
+- Removed the -explain flag from all systemds calls.
+
+### Bash scripts that now call a new DML script that makes use of a built-in function, instead of a fully implemented algorithm
+- perftest/runMultiLogReg.sh -> perftest/scripts/MultiLogReg.dml
+- perftest/runL2SVM.sh -> perftest/scripts/l2-svm-predict.dml
+- perftest/runMSVM.sh -> perftest/scripts/m-svm.dml
+- perftest/runMSVM.sh -> perftest/scripts/m-svm-predict.dml
+- perftest/runNaiveBayes.sh -> perftest/scripts/naive-bayes.dml
+- perftest/runNaiveBayes.sh -> perftest/scripts/naive-bayes-predict.dml
+- perftest/runLinearRegCG.sh -> perftest/scripts/LinearRegCG.dml
+- perftest/runLinearRegDS.sh -> perftest/scripts/LinearRegDS.dml
+- perftest/runGLM_poisson_log.sh -> perftest/scripts/GLM.dml
+- perftest/runGLM_gamma_log.sh -> perftest/scripts/GLM.dml
+- perftest/runGLM_binomial_probit.sh -> perftest/scripts/GLM.dml
+
+
+### Bash scripts still calling old DML scripts, which fully implement algorithms
+- perftest/runMultiLogReg.sh -> algorithms/GLM-predict.dml
+- perftest/runLinearRegCG.sh -> algorithms/GLM-predict.dml
+- perftest/runLinearRegDS.sh -> algorithms/GLM-predict.dml
+- perftest/runGLM_poisson_log.sh -> algorithms/GLM-predict.dml
+- perftest/runGLM_gamma_log.sh -> algorithms/GLM-predict.dml
+- perftest/runGLM_binomial_probit.sh -> algorithms/GLM-predict.dml
+
+### Bash scripts that already did call a DML script with a single built-in functions (only needed some refactoring)
+- perftest/runL2SVM.sh -> algorithms/l2-svm.dml (This already uses the built-in function l2svm!)
+
+
+
+	
+	
+	
+	
+	
+	
+

Review comment:
       remove these 10 free lines.

##########
File path: scripts/perftest/runAll.sh
##########
@@ -20,11 +20,67 @@
 #
 #-------------------------------------------------------------
 
+#if [ "$1" == "" -o "$2" == "" ]; then echo "Usage: $0 <hdfsDataDir> <MR | SPARK | ECHO>   e.g. $0 perftest SPARK" ; exit 1 ; fi
 
-# Micro Benchmarks:
+# Example usage:
+# ./runAll.sh temp MR
 
-./scripts/perftest/MatrixMult.sh
-./scripts/perftest/MatrixTranspose.sh
+# First argument is optional, but can be e.g. perftestTemp
+TEMPFOLDER=$1
+if [ "$TEMPFOLDER" == "" ]; then TEMPFOLDER=temp ; fi
 
-# Algorithms Benchmarks:
+# Second argument is optional, but can be MR | SPARK | ECHO
+COMMAND=$2
+if [ "$COMMAND" == "" ]; then COMMAND=MR ; fi
+
+# Set properties
+export LOG4JPROP='conf/log4j-off.properties'
+export SYSDS_QUIET=1
+#export SYSTEMDS_ROOT=$(pwd)

Review comment:
       Remove commented code that is no longer used.

##########
File path: scripts/perftest/runAll.sh
##########
@@ -20,11 +20,67 @@
 #
 #-------------------------------------------------------------
 
+#if [ "$1" == "" -o "$2" == "" ]; then echo "Usage: $0 <hdfsDataDir> <MR | SPARK | ECHO>   e.g. $0 perftest SPARK" ; exit 1 ; fi
 
-# Micro Benchmarks:
+# Example usage:
+# ./runAll.sh temp MR
 
-./scripts/perftest/MatrixMult.sh
-./scripts/perftest/MatrixTranspose.sh
+# First argument is optional, but can be e.g. perftestTemp
+TEMPFOLDER=$1
+if [ "$TEMPFOLDER" == "" ]; then TEMPFOLDER=temp ; fi
 
-# Algorithms Benchmarks:
+# Second argument is optional, but can be MR | SPARK | ECHO
+COMMAND=$2
+if [ "$COMMAND" == "" ]; then COMMAND=MR ; fi
+
+# Set properties
+export LOG4JPROP='conf/log4j-off.properties'
+export SYSDS_QUIET=1
+#export SYSTEMDS_ROOT=$(pwd)
+#export PATH=$SYSTEMDS_ROOT/bin:$PATH
+
+
+# Initialize Intel MKL

Review comment:
       After thinking more about it, I think it would be best to allow people to configure the use of BLAS from outside, without trying to hard-code it here.

##########
File path: scripts/perftest/CHANGES.md
##########
@@ -0,0 +1,48 @@
+# New additions to the performance test suite

Review comment:
       please add the license header (see README)

##########
File path: scripts/perftest/README.md
##########
@@ -28,10 +28,16 @@ There are a few prerequisites:
 - Setup OpenBlas: <https://github.com/xianyi/OpenBLAS/wiki/Precompiled-installation-packages>
 - Install Perf stat: <https://linoxide.com/linux-how-to/install-perf-tool-centos-ubuntu/>
 
-NOTE THE SCRIPT HAS TO BE RUN FROM THE ROOT OF THE REPOSITORY.
+NOTE THE SCRIPT HAS TO BE RUN FROM THE PERFTEST FOLDER.
 
+Examples:
 ```bash
-./scripts/perftest/runAll.sh
+./runAll.sh
+
+./runAll.sh perftestTemp MR

Review comment:
       I don't think we need the MR vs CP differentiation here; simply let a developer specify the CMD that is used consistently.

##########
File path: scripts/perftest/scripts/LinearRegCG.dml
##########
@@ -0,0 +1,66 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# THIS SCRIPT SOLVES LINEAR REGRESSION USING THE CONJUGATE GRADIENT ALGORITHM

Review comment:
       There is no need to duplicate these comments here. Maybe reduce the dml script to the read, core invocation, and selected comments for used or passed arguments.

##########
File path: scripts/perftest/genBinomialData.sh
##########
@@ -0,0 +1,68 @@
+#!/bin/bash
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+# 
+#   http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+if [ "$1" == "" -o "$2" == "" ]; then echo "Usage: $0 <hdfsDataDir> <MR | SPARK | ECHO>   e.g. $0 perftest SPARK" ; exit 1 ; fi
+if [ "$2" == "SPARK" ]; then CMD="./sparkDML.sh "; DASH="-"; elif [ "$2" == "MR" ]; then CMD="systemds " ; else CMD="echo " ; fi
+
+BASE=$1/binomial
+
+FORMAT="binary" # can be csv, mm, text, binary
+DENSE_SP=0.9
+SPARSE_SP=0.01
+
+
+#generate XS scenarios (80MB)
+${CMD} -f ../datagen/genRandData4LogisticRegression.dml --args 10000 1000 5 5 ${BASE}/w10k_1k_dense ${BASE}/X10k_1k_dense ${BASE}/y10k_1k_dense 1 0 $DENSE_SP $FORMAT 1
+${CMD} -f ../datagen/genRandData4LogisticRegression.dml --args 10000 1000 5 5 ${BASE}/w10k_1k_sparse ${BASE}/X10k_1k_sparse ${BASE}/y10k_1k_sparse 1 0 $SPARSE_SP $FORMAT 1
+${CMD} -f scripts/extractTestData.dml --args ${BASE}/X10k_1k_dense ${BASE}/y10k_1k_dense ${BASE}/X10k_1k_dense_test ${BASE}/y10k_1k_dense_test $FORMAT
+${CMD} -f scripts/extractTestData.dml --args ${BASE}/X10k_1k_sparse ${BASE}/y10k_1k_sparse ${BASE}/X10k_1k_sparse_test ${BASE}/y10k_1k_sparse_test $FORMAT
+
+##generate S scenarios (800MB)
+#${CMD} -f ../datagen/genRandData4LogisticRegression.dml --args 100000 1000 5 5 ${BASE}/w100k_1k_dense ${BASE}/X100k_1k_dense ${BASE}/y100k_1k_dense 1 0 $DENSE_SP $FORMAT 1
+#${CMD} -f ../datagen/genRandData4LogisticRegression.dml --args 100000 1000 5 5 ${BASE}/w100k_1k_sparse ${BASE}/X100k_1k_sparse ${BASE}/y100k_1k_sparse 1 0 $SPARSE_SP $FORMAT 1
+#${CMD} -f scripts/extractTestData.dml --args ${BASE}/X100k_1k_dense ${BASE}/y100k_1k_dense ${BASE}/X100k_1k_dense_test ${BASE}/y100k_1k_dense_test $FORMAT
+#${CMD} -f scripts/extractTestData.dml --args ${BASE}/X100k_1k_sparse ${BASE}/y100k_1k_sparse ${BASE}/X100k_1k_sparse_test ${BASE}/y100k_1k_sparse_test $FORMAT
+#
+##generate M scenarios (8GB)
+#${CMD} -f ../datagen/genRandData4LogisticRegression.dml --args 1000000 1000 5 5 ${BASE}/w1M_1k_dense ${BASE}/X1M_1k_dense ${BASE}/y1M_1k_dense 1 0 $DENSE_SP $FORMAT 1
+#${CMD} -f ../datagen/genRandData4LogisticRegression.dml --args 1000000 1000 5 5 ${BASE}/w1M_1k_sparse ${BASE}/X1M_1k_sparse ${BASE}/y1M_1k_sparse 1 0 $SPARSE_SP $FORMAT 1
+#${CMD} -f scripts/extractTestData.dml --args ${BASE}/X1M_1k_dense ${BASE}/y1M_1k_dense ${BASE}/X1M_1k_dense_test ${BASE}/y1M_1k_dense_test $FORMAT
+#${CMD} -f scripts/extractTestData.dml --args ${BASE}/X1M_1k_sparse ${BASE}/y1M_1k_sparse ${BASE}/X1M_1k_sparse_test ${BASE}/y1M_1k_sparse_test $FORMAT
+#
+##generate L scenarios (80GB)
+#${CMD} -f ../datagen/genRandData4LogisticRegression.dml --args 10000000 1000 5 5 ${BASE}/w10M_1k_dense ${BASE}/X10M_1k_dense ${BASE}/y10M_1k_dense 1 0 $DENSE_SP $FORMAT 1
+#${CMD} -f ../datagen/genRandData4LogisticRegression.dml --args 10000000 1000 5 5 ${BASE}/w10M_1k_sparse ${BASE}/X10M_1k_sparse ${BASE}/y10M_1k_sparse 1 0 $SPARSE_SP $FORMAT 1
+#${CMD} -f scripts/extractTestData.dml --args ${BASE}/X10M_1k_dense ${BASE}/y10M_1k_dense ${BASE}/X10M_1k_dense_test ${BASE}/y10M_1k_dense_test $FORMAT
+#${CMD} -f scripts/extractTestData.dml --args ${BASE}/X10M_1k_sparse ${BASE}/y10M_1k_sparse ${BASE}/X10M_1k_sparse_test ${BASE}/y10M_1k_sparse_test $FORMAT
+#
+##generate XL scenarios (800GB)
+#${CMD} -f ../datagen/genRandData4LogisticRegression.dml --args 100000000 1000 5 5 ${BASE}/w100M_1k_dense ${BASE}/X100M_1k_dense ${BASE}/y100M_1k_dense 1 0 $DENSE_SP $FORMAT 1
+#${CMD} -f ../datagen/genRandData4LogisticRegression.dml --args 100000000 1000 5 5 ${BASE}/w100M_1k_sparse ${BASE}/X100M_1k_sparse ${BASE}/y100M_1k_sparse 1 0 $SPARSE_SP $FORMAT 1
+#${CMD} -f scripts/extractTestData.dml --args ${BASE}/X100M_1k_dense ${BASE}/y100M_1k_dense ${BASE}/X100M_1k_dense_test ${BASE}/y100M_1k_dense_test $FORMAT
+#${CMD} -f scripts/extractTestData.dml --args ${BASE}/X100M_1k_sparse ${BASE}/y100M_1k_sparse ${BASE}/X100M_1k_sparse_test ${BASE}/y100M_1k_sparse_test $FORMAT
+#
+###generate KDD scenario (csv would be infeasible)

Review comment:
       please remove the KDD scenario, which was originally a test for checking ultra-sparse (sparsity=1e-7) distributed data sets - however this testsuite should mostly handle synthetic datasets.

##########
File path: scripts/perftest/genMultinomialData.sh
##########
@@ -0,0 +1,60 @@
+#!/bin/bash
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+# 
+#   http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+if [ "$1" == "" -o "$2" == "" ]; then echo "Usage: $0 <hdfsDataDir> <MR | SPARK | ECHO>   e.g. $0 perftest SPARK" ; exit 1 ; fi
+if [ "$2" == "SPARK" ]; then CMD="./sparkDML.sh "; DASH="-"; elif [ "$2" == "MR" ]; then CMD="systemds " ; else CMD="echo " ; fi

Review comment:
       see above, please allow let people specify CMD themselfs, maybe in runAll.sh

##########
File path: scripts/perftest/scripts/m-svm.dml
##########
@@ -0,0 +1,67 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Call of built-in multiclass SVM. Check documentation for more information:
+# https://apache.github.io/systemds/site/builtins-reference#msvm-function
+# 
+# Example Usage:
+# Assume SVM_HOME is set to the home of the dml script
+# Assume input and output directories are on hdfs as INPUT_DIR and OUTPUT_DIR
+# Assume epsilon = 0.001, lambda=1.0, maxiterations = 100
+#
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME      TYPE    DEFAULT     MEANING
+# ---------------------------------------------------------------------------------------------
+# X         String  ---         Location to read the matrix X of feature vectors
+# Y         String  ---         Location to read response matrix Y
+# icpt      Int     0           Intercept presence
+#                               0 = no intercept
+#                               1 = add intercept;
+# tol       Double  0.001       Tolerance (epsilon);
+# reg       Double  1.0         Regularization parameter (lambda)
+# maxiter   Int     100         Maximum number of conjugate gradient iterations
+# model     String  ---         Location to write model
+# fmt       String  "text"      The output format of the output, such as "text" or "csv"
+# ---------------------------------------------------------------------------------------------
+#
+# Script invocation example:

Review comment:
       Please remove these invocation examples.

##########
File path: scripts/perftest/runAll.sh
##########
@@ -20,11 +20,67 @@
 #
 #-------------------------------------------------------------
 
+#if [ "$1" == "" -o "$2" == "" ]; then echo "Usage: $0 <hdfsDataDir> <MR | SPARK | ECHO>   e.g. $0 perftest SPARK" ; exit 1 ; fi
 
-# Micro Benchmarks:
+# Example usage:
+# ./runAll.sh temp MR
 
-./scripts/perftest/MatrixMult.sh
-./scripts/perftest/MatrixTranspose.sh
+# First argument is optional, but can be e.g. perftestTemp
+TEMPFOLDER=$1
+if [ "$TEMPFOLDER" == "" ]; then TEMPFOLDER=temp ; fi
 
-# Algorithms Benchmarks:
+# Second argument is optional, but can be MR | SPARK | ECHO
+COMMAND=$2
+if [ "$COMMAND" == "" ]; then COMMAND=MR ; fi
+
+# Set properties
+export LOG4JPROP='conf/log4j-off.properties'
+export SYSDS_QUIET=1
+#export SYSTEMDS_ROOT=$(pwd)
+#export PATH=$SYSTEMDS_ROOT/bin:$PATH
+
+
+# Initialize Intel MKL

Review comment:
       Maybe just take the three lines and them as a comment to runAll

##########
File path: scripts/perftest/scripts/extractTestData.dml
##########
@@ -0,0 +1,31 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+# 
+#   http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+n = 5000;
+
+X = read($1);
+y = read($2);
+
+X = X[1:n,];
+y = y[1:n,];

Review comment:
       Please replace by the split builtin function with a fixed fraction instead of constant 5000 tuples - this would also test more scenarios including batch prediction.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org