You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2020/09/28 16:01:27 UTC

[GitHub] [systemds] SvenCelin opened a new pull request #1071: Lasso and PPCA

SvenCelin opened a new pull request #1071:
URL: https://github.com/apache/systemds/pull/1071


   Lasso and PPCA functions transferred in Builtins


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] SvenCelin commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
SvenCelin commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r497409877



##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   	TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# X  	 	String ---      location to read the matrix X input matrix
+# k      	Int    ---      indicates dimension of the new vector space constructed from eigen vectors
+# tolobj 	Int    0.00001  objective function tolerance value to stop ppca algorithm
+# tolrecerr	Int    0.02     reconstruction error tolerance value to stop the algorithm
+# iter   	Int    10       maximum number of iterations
+# fmt    	String 'text'   output format of results PPCA such as "text" or "csv"

Review comment:
       Done. 

##########
File path: src/main/java/org/apache/sysds/common/Builtins.java
##########
@@ -152,6 +153,7 @@
 	PCA("pca", true),
 	PNMF("pnmf", true),
 	PPRED("ppred", false),
+	PPCA("ppca", true),

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] SvenCelin commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
SvenCelin commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r497410355



##########
File path: src/test/scripts/functions/builtin/lasso.dml
##########
@@ -0,0 +1,25 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+X = read($1)
+y = read($2)
+w = lasso(X = X, y = y)
+write(w, $3)

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] SvenCelin commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
SvenCelin commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r497415263



##########
File path: src/test/java/org/apache/sysds/test/functions/builtin/BuiltinPPCATest.java
##########
@@ -0,0 +1,49 @@
+package org.apache.sysds.test.functions.builtin;

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] SvenCelin commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
SvenCelin commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r497408288



##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression
+
+
+m_lasso = function(Matrix[Double] X, Matrix[Double] y) return(Matrix[Double] w)
+    {
+    n = nrow(X)
+    m = ncol(X)
+
+
+    #params

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] SvenCelin commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
SvenCelin commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r497408793



##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   	TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# X  	 	String ---      location to read the matrix X input matrix
+# k      	Int    ---      indicates dimension of the new vector space constructed from eigen vectors
+# tolobj 	Int    0.00001  objective function tolerance value to stop ppca algorithm
+# tolrecerr	Int    0.02     reconstruction error tolerance value to stop the algorithm
+# iter   	Int    10       maximum number of iterations
+# fmt    	String 'text'   output format of results PPCA such as "text" or "csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] mboehm7 commented on pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
mboehm7 commented on pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#issuecomment-706586118


   LGTM - thanks for the patch @SvenCelin. This completes the AMLS project.
   
   During the merge, I however, made a number of modifications: (1) added test assertions on expected results, (2) fixed the formatting of tests (tabs over spaces for java code), (3) fixed the integration of ppca (the failure was due to a typo: pcaa instead of ppca), (4) modified the outputs of ppca to match the outputs of pca, (5) modified the parameters of ppca (e.g., passing of k), (6) added documentation for the parameters of lasso, and (7) fixed the formatting of builtin functions (2 space indentation).
     


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] asfgit closed pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #1071:
URL: https://github.com/apache/systemds/pull/1071


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] SvenCelin commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
SvenCelin commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r497407851



##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression
+
+
+m_lasso = function(Matrix[Double] X, Matrix[Double] y) return(Matrix[Double] w)
+    {

Review comment:
       Done.

##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression
+

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] SvenCelin commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
SvenCelin commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r497408892



##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   	TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# X  	 	String ---      location to read the matrix X input matrix
+# k      	Int    ---      indicates dimension of the new vector space constructed from eigen vectors
+# tolobj 	Int    0.00001  objective function tolerance value to stop ppca algorithm
+# tolrecerr	Int    0.02     reconstruction error tolerance value to stop the algorithm
+# iter   	Int    10       maximum number of iterations
+# fmt    	String 'text'   output format of results PPCA such as "text" or "csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100
+# ---------------------------------------------------------------------------------------------
+# OUTPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# C     	Matrix  ---     principal components
+# V      	Matrix  ---     eigenvalues / eigenvalues of principal components
+#
+
+
+m_pcaa = function(Matrix[Double] X, int iter = 10, double tolobj = 0.00001, double tolrecerr = 0.02, String fmt0 = "text") return(Matrix[Double] PC, Matrix[Double] V)
+    {
+    k = ncol(X)
+    n = nrow(X);
+    m = ncol(X);
+
+    #initializing principal components matrix
+    C =  rand(rows=m, cols=k, pdf="normal");
+    ss = rand(rows=1, cols=1, pdf="normal");
+    ss = as.scalar(ss);
+    ssPrev = ss;
+
+    # best selected principle components - with the lowest reconstruction error
+    PC = C;
+
+    # initilizing reconstruction error
+    RE = tolrecerr+1;
+    REBest = RE;
+
+    Z = matrix(0,rows=1,cols=1);
+
+    #Objective function value
+    ObjRelChng = tolobj+1;
+
+    # mean centered input matrix - dim -> [n,m]
+    Xm = X - colMeans(X);
+
+    #I -> k x k
+    ITMP = matrix(1,rows=k,cols=1);
+    I = diag(ITMP);
+
+    i = 0;
+    while (i < iter & ObjRelChng > tolobj & RE > tolrecerr){
+        #Estimation step - Covariance matrix
+        #M -> k x k
+        M = t(C) %*% C + I*ss;
+
+        #Auxilary matrix with n latent variables
+        # Z -> n x k
+        Z = Xm %*% (C %*% inv(M));
+
+        #ZtZ -> k x k
+        ZtZ = t(Z) %*% Z + inv(M)*ss;
+
+        #XtZ -> m x k
+        XtZ = t(Xm) %*% Z;
+
+        #Maximization step
+        #C ->  m x k
+        ZtZ_sum = sum(ZtZ); #+n*inv(M));
+        C = XtZ/ZtZ_sum;
+
+        #ss2 -> 1 x 1
+        ss2 = trace(ZtZ * (t(C) %*% C));
+
+        #ss3 -> 1 x 1
+        ss3 = sum((Z %*% t(C)) %*% t(Xm));
+
+        #Frobenius norm of reconstruction error -> Euclidean norm
+        #Fn -> 1 x 1
+        Fn = sum(Xm*Xm);
+
+        #ss -> 1 x 1
+        ss = (Fn + ss2 - 2*ss3)/(n*m);
+
+       #calculating objective function relative change
+       ObjRelChng = abs(1 - ss/ssPrev);
+       #print("Objective Relative Change: " + ObjRelChng + ", Objective: " + ss);

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] SvenCelin commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
SvenCelin commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r497410275



##########
File path: src/test/scripts/functions/builtin/PPCA.dml
##########
@@ -0,0 +1,25 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+X = read($1)
+PC = ppca(X = X)
+write(PC, V)
+

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] SvenCelin commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
SvenCelin commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r497408662



##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression
+
+
+m_lasso = function(Matrix[Double] X, Matrix[Double] y) return(Matrix[Double] w)
+    {
+    n = nrow(X)
+    m = ncol(X)
+
+
+    #params
+    tol = 10^(-15)
+    M = 5
+    tau = 1
+    maxiter = 1000
+
+    #constants

Review comment:
       They are constants, I don't think the tampering with them would do much good. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] SvenCelin commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
SvenCelin commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r497408157



##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression

Review comment:
       It was a typo, I have changed it in the description now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] SvenCelin commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
SvenCelin commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r497410485



##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   	TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# X  	 	String ---      location to read the matrix X input matrix
+# k      	Int    ---      indicates dimension of the new vector space constructed from eigen vectors
+# tolobj 	Int    0.00001  objective function tolerance value to stop ppca algorithm
+# tolrecerr	Int    0.02     reconstruction error tolerance value to stop the algorithm
+# iter   	Int    10       maximum number of iterations
+# fmt    	String 'text'   output format of results PPCA such as "text" or "csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100
+# ---------------------------------------------------------------------------------------------
+# OUTPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# C     	Matrix  ---     principal components
+# V      	Matrix  ---     eigenvalues / eigenvalues of principal components
+#
+
+
+m_pcaa = function(Matrix[Double] X, int iter = 10, double tolobj = 0.00001, double tolrecerr = 0.02, String fmt0 = "text") return(Matrix[Double] PC, Matrix[Double] V)
+    {
+    k = ncol(X)
+    n = nrow(X);
+    m = ncol(X);
+
+    #initializing principal components matrix
+    C =  rand(rows=m, cols=k, pdf="normal");
+    ss = rand(rows=1, cols=1, pdf="normal");
+    ss = as.scalar(ss);
+    ssPrev = ss;
+
+    # best selected principle components - with the lowest reconstruction error
+    PC = C;
+
+    # initilizing reconstruction error
+    RE = tolrecerr+1;
+    REBest = RE;
+
+    Z = matrix(0,rows=1,cols=1);
+
+    #Objective function value
+    ObjRelChng = tolobj+1;
+
+    # mean centered input matrix - dim -> [n,m]
+    Xm = X - colMeans(X);
+
+    #I -> k x k
+    ITMP = matrix(1,rows=k,cols=1);
+    I = diag(ITMP);
+
+    i = 0;
+    while (i < iter & ObjRelChng > tolobj & RE > tolrecerr){
+        #Estimation step - Covariance matrix
+        #M -> k x k
+        M = t(C) %*% C + I*ss;
+
+        #Auxilary matrix with n latent variables
+        # Z -> n x k
+        Z = Xm %*% (C %*% inv(M));
+
+        #ZtZ -> k x k
+        ZtZ = t(Z) %*% Z + inv(M)*ss;
+
+        #XtZ -> m x k
+        XtZ = t(Xm) %*% Z;
+
+        #Maximization step
+        #C ->  m x k
+        ZtZ_sum = sum(ZtZ); #+n*inv(M));
+        C = XtZ/ZtZ_sum;
+
+        #ss2 -> 1 x 1
+        ss2 = trace(ZtZ * (t(C) %*% C));
+
+        #ss3 -> 1 x 1
+        ss3 = sum((Z %*% t(C)) %*% t(Xm));
+
+        #Frobenius norm of reconstruction error -> Euclidean norm
+        #Fn -> 1 x 1
+        Fn = sum(Xm*Xm);
+
+        #ss -> 1 x 1
+        ss = (Fn + ss2 - 2*ss3)/(n*m);
+
+       #calculating objective function relative change
+       ObjRelChng = abs(1 - ss/ssPrev);
+       #print("Objective Relative Change: " + ObjRelChng + ", Objective: " + ss);
+
+        #Reconstruction error
+        R = ((Z %*% t(C)) -  Xm);
+
+        #calculate the error
+        #TODO rethink calculation of reconstruction error ....
+        #1-Norm of reconstruction error - a big dense matrix
+        #RE -> n x m
+        RE = abs(sum(R)/sum(Xm));
+        if (RE < REBest){
+            PC = C;
+            REBest = RE;
+        }
+        #print("ss: " + ss +" = Fn( "+ Fn +" ) + ss2( " + ss2  +" ) - 2*ss3( " + ss3 + " ), Reconstruction Error: " + RE);
+
+        ssPrev = ss;
+        i = i+1;
+    }
+    print("Objective Relative Change: " + ObjRelChng);
+    print ("Number of iterations: " + i + ", Reconstruction Err: " + REBest);
+
+    # reconstructs data
+    # RD -> n x k
+    RD = X %*% PC;
+
+    # calculate eigenvalues - principle component variance
+    RDMean = colMeans(RD);
+    V = t(colMeans(RD*RD) - (RDMean*RDMean));
+
+    # sorting eigenvalues and eigenvectors in decreasing order
+    V_decr_idx = order(target=V,by=1,decreasing=TRUE,index.return=TRUE);
+    VF_decr = table(seq(1,nrow(V)),V_decr_idx);
+    V = VF_decr %*% V;
+    PC = PC %*% VF_decr;
+
+    # writing principal components
+    # write(PC, fileC, format=fmt0);
+    # writing eigen values/pc variance
+    # write(V, fileV, format=fmt0);

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r496651094



##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression
+

Review comment:
       Please add some documentation of the function. You can look at something like the l2svm for inspiration.

##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression
+
+
+m_lasso = function(Matrix[Double] X, Matrix[Double] y) return(Matrix[Double] w)

Review comment:
       add an verbose flag, to enable and disable printing.

##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression
+
+
+m_lasso = function(Matrix[Double] X, Matrix[Double] y) return(Matrix[Double] w)
+    {
+    n = nrow(X)
+    m = ncol(X)
+
+
+    #params
+    tol = 10^(-15)
+    M = 5
+    tau = 1
+    maxiter = 1000
+
+    #constants

Review comment:
       also if it is appropriate add the constants, if they are intended to be modified. ... (constants are not usually intended to be but please consider for the individual cases here)

##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression

Review comment:
       just because i don't know, `sparsa` algorithm is a term? or a typo?

##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   	TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# X  	 	String ---      location to read the matrix X input matrix
+# k      	Int    ---      indicates dimension of the new vector space constructed from eigen vectors
+# tolobj 	Int    0.00001  objective function tolerance value to stop ppca algorithm
+# tolrecerr	Int    0.02     reconstruction error tolerance value to stop the algorithm
+# iter   	Int    10       maximum number of iterations
+# fmt    	String 'text'   output format of results PPCA such as "text" or "csv"

Review comment:
       remove fmt argument

##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression
+
+
+m_lasso = function(Matrix[Double] X, Matrix[Double] y) return(Matrix[Double] w)
+    {

Review comment:
       double check the indentation.

##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression
+
+
+m_lasso = function(Matrix[Double] X, Matrix[Double] y) return(Matrix[Double] w)
+    {
+    n = nrow(X)
+    m = ncol(X)
+
+
+    #params

Review comment:
       please add these as parameters to the algorithm

##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   	TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# X  	 	String ---      location to read the matrix X input matrix
+# k      	Int    ---      indicates dimension of the new vector space constructed from eigen vectors
+# tolobj 	Int    0.00001  objective function tolerance value to stop ppca algorithm
+# tolrecerr	Int    0.02     reconstruction error tolerance value to stop the algorithm
+# iter   	Int    10       maximum number of iterations
+# fmt    	String 'text'   output format of results PPCA such as "text" or "csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100
+# ---------------------------------------------------------------------------------------------
+# OUTPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# C     	Matrix  ---     principal components
+# V      	Matrix  ---     eigenvalues / eigenvalues of principal components
+#
+
+
+m_pcaa = function(Matrix[Double] X, int iter = 10, double tolobj = 0.00001, double tolrecerr = 0.02, String fmt0 = "text") return(Matrix[Double] PC, Matrix[Double] V)
+    {
+    k = ncol(X)
+    n = nrow(X);
+    m = ncol(X);
+
+    #initializing principal components matrix
+    C =  rand(rows=m, cols=k, pdf="normal");
+    ss = rand(rows=1, cols=1, pdf="normal");
+    ss = as.scalar(ss);
+    ssPrev = ss;
+
+    # best selected principle components - with the lowest reconstruction error
+    PC = C;
+
+    # initilizing reconstruction error
+    RE = tolrecerr+1;
+    REBest = RE;
+
+    Z = matrix(0,rows=1,cols=1);
+
+    #Objective function value
+    ObjRelChng = tolobj+1;
+
+    # mean centered input matrix - dim -> [n,m]
+    Xm = X - colMeans(X);
+
+    #I -> k x k
+    ITMP = matrix(1,rows=k,cols=1);
+    I = diag(ITMP);
+
+    i = 0;
+    while (i < iter & ObjRelChng > tolobj & RE > tolrecerr){
+        #Estimation step - Covariance matrix
+        #M -> k x k
+        M = t(C) %*% C + I*ss;
+
+        #Auxilary matrix with n latent variables
+        # Z -> n x k
+        Z = Xm %*% (C %*% inv(M));
+
+        #ZtZ -> k x k
+        ZtZ = t(Z) %*% Z + inv(M)*ss;
+
+        #XtZ -> m x k
+        XtZ = t(Xm) %*% Z;
+
+        #Maximization step
+        #C ->  m x k
+        ZtZ_sum = sum(ZtZ); #+n*inv(M));
+        C = XtZ/ZtZ_sum;
+
+        #ss2 -> 1 x 1
+        ss2 = trace(ZtZ * (t(C) %*% C));
+
+        #ss3 -> 1 x 1
+        ss3 = sum((Z %*% t(C)) %*% t(Xm));
+
+        #Frobenius norm of reconstruction error -> Euclidean norm
+        #Fn -> 1 x 1
+        Fn = sum(Xm*Xm);
+
+        #ss -> 1 x 1
+        ss = (Fn + ss2 - 2*ss3)/(n*m);
+
+       #calculating objective function relative change
+       ObjRelChng = abs(1 - ss/ssPrev);
+       #print("Objective Relative Change: " + ObjRelChng + ", Objective: " + ss);

Review comment:
       double check indentation

##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   	TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# X  	 	String ---      location to read the matrix X input matrix
+# k      	Int    ---      indicates dimension of the new vector space constructed from eigen vectors
+# tolobj 	Int    0.00001  objective function tolerance value to stop ppca algorithm
+# tolrecerr	Int    0.02     reconstruction error tolerance value to stop the algorithm
+# iter   	Int    10       maximum number of iterations
+# fmt    	String 'text'   output format of results PPCA such as "text" or "csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100

Review comment:
       hadoop is not used like this anymore, therefore remove this line.

##########
File path: src/test/java/org/apache/sysds/test/functions/builtin/BuiltinLassoTest.java
##########
@@ -0,0 +1,55 @@
+package org.apache.sysds.test.functions.builtin;

Review comment:
       missing license, therefore the tests on git fail.
   
   use `mvn package -P rat` and look at the file generated at `/target/rat.txt` to find these errors.

##########
File path: src/main/java/org/apache/sysds/common/Builtins.java
##########
@@ -152,6 +153,7 @@
 	PCA("pca", true),
 	PNMF("pnmf", true),
 	PPRED("ppred", false),
+	PPCA("ppca", true),

Review comment:
       move one line up ... alphabetical order

##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   	TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# X  	 	String ---      location to read the matrix X input matrix
+# k      	Int    ---      indicates dimension of the new vector space constructed from eigen vectors
+# tolobj 	Int    0.00001  objective function tolerance value to stop ppca algorithm
+# tolrecerr	Int    0.02     reconstruction error tolerance value to stop the algorithm
+# iter   	Int    10       maximum number of iterations
+# fmt    	String 'text'   output format of results PPCA such as "text" or "csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100
+# ---------------------------------------------------------------------------------------------
+# OUTPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# C     	Matrix  ---     principal components
+# V      	Matrix  ---     eigenvalues / eigenvalues of principal components
+#
+
+
+m_pcaa = function(Matrix[Double] X, int iter = 10, double tolobj = 0.00001, double tolrecerr = 0.02, String fmt0 = "text") return(Matrix[Double] PC, Matrix[Double] V)

Review comment:
       remove the fmt argument

##########
File path: src/test/java/org/apache/sysds/test/functions/builtin/BuiltinLassoTest.java
##########
@@ -0,0 +1,55 @@
+package org.apache.sysds.test.functions.builtin;
+
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.junit.Test;
+
+import java.util.ArrayList;
+import java.util.List;
+
+
+public class BuiltinLassoTest extends AutomatedTestBase{
+
+    private final static String TEST_NAME = "lasso";
+    private final static String TEST_DIR = "functions/builtin/";
+    private final static String TEST_CLASS_DIR = TEST_DIR + BuiltinLassoTest.class.getSimpleName() + "/";
+
+    private final static int rows = 100;
+    private final static int cols = 10;
+
+    @Override
+    public void setUp(){
+        addTestConfiguration(TEST_NAME, new TestConfiguration(TEST_CLASS_DIR, TEST_NAME, new String[]{"B"}));
+    }
+
+    @Test
+    public void testLasso(){ runLassoTest(); }
+
+
+    private void runLassoTest(){
+
+        loadTestConfiguration(getTestConfiguration(TEST_NAME));
+        String HOME = SCRIPT_DIR + TEST_DIR;
+        fullDMLScriptName = HOME + TEST_NAME + ".dml";
+        List<String> proArgs = new ArrayList<>();
+
+
+        proArgs.add("-explain");
+        proArgs.add("-stats");
+        proArgs.add("-args");
+        proArgs.add(input("X"));
+        proArgs.add(input("y"));
+        proArgs.add(output("w"));
+        programArgs = proArgs.toArray(new String[proArgs.size()]);
+        double[][] X = getRandomMatrix(rows, cols, 0, 1, 0.8, -1);
+        double[][] y = getRandomMatrix(rows, 1, 0, 1, 0.8, -1);
+        writeInputMatrixWithMTD("X", X, true);
+        writeInputMatrixWithMTD("y", y, true);
+
+
+        runTest(true, EXCEPTION_NOT_EXPECTED, null, -1);
+

Review comment:
       Please add logic to verify that the algorithm runs correctly.
   The current code only execute the algorithm to see if it crashes. we need the test to also verify that the algorithm outputs something reasonable.
   
   You can engineer your input to make something that make sense in the output, based on the algorithm.

##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression

Review comment:
       if it is a term, i would need a link to some documentation or more information to be able to search for it efficiently.

##########
File path: src/test/scripts/functions/builtin/lasso.dml
##########
@@ -0,0 +1,25 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+X = read($1)
+y = read($2)
+w = lasso(X = X, y = y)
+write(w, $3)

Review comment:
       newline to make GitHub happy.

##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   	TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# X  	 	String ---      location to read the matrix X input matrix
+# k      	Int    ---      indicates dimension of the new vector space constructed from eigen vectors
+# tolobj 	Int    0.00001  objective function tolerance value to stop ppca algorithm
+# tolrecerr	Int    0.02     reconstruction error tolerance value to stop the algorithm
+# iter   	Int    10       maximum number of iterations
+# fmt    	String 'text'   output format of results PPCA such as "text" or "csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100
+# ---------------------------------------------------------------------------------------------
+# OUTPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# C     	Matrix  ---     principal components
+# V      	Matrix  ---     eigenvalues / eigenvalues of principal components
+#
+
+
+m_pcaa = function(Matrix[Double] X, int iter = 10, double tolobj = 0.00001, double tolrecerr = 0.02, String fmt0 = "text") return(Matrix[Double] PC, Matrix[Double] V)
+    {
+    k = ncol(X)
+    n = nrow(X);
+    m = ncol(X);
+
+    #initializing principal components matrix
+    C =  rand(rows=m, cols=k, pdf="normal");
+    ss = rand(rows=1, cols=1, pdf="normal");
+    ss = as.scalar(ss);
+    ssPrev = ss;
+
+    # best selected principle components - with the lowest reconstruction error
+    PC = C;
+
+    # initilizing reconstruction error
+    RE = tolrecerr+1;
+    REBest = RE;
+
+    Z = matrix(0,rows=1,cols=1);
+
+    #Objective function value
+    ObjRelChng = tolobj+1;
+
+    # mean centered input matrix - dim -> [n,m]
+    Xm = X - colMeans(X);
+
+    #I -> k x k
+    ITMP = matrix(1,rows=k,cols=1);
+    I = diag(ITMP);
+
+    i = 0;
+    while (i < iter & ObjRelChng > tolobj & RE > tolrecerr){
+        #Estimation step - Covariance matrix
+        #M -> k x k
+        M = t(C) %*% C + I*ss;
+
+        #Auxilary matrix with n latent variables
+        # Z -> n x k
+        Z = Xm %*% (C %*% inv(M));
+
+        #ZtZ -> k x k
+        ZtZ = t(Z) %*% Z + inv(M)*ss;
+
+        #XtZ -> m x k
+        XtZ = t(Xm) %*% Z;
+
+        #Maximization step
+        #C ->  m x k
+        ZtZ_sum = sum(ZtZ); #+n*inv(M));
+        C = XtZ/ZtZ_sum;
+
+        #ss2 -> 1 x 1
+        ss2 = trace(ZtZ * (t(C) %*% C));
+
+        #ss3 -> 1 x 1
+        ss3 = sum((Z %*% t(C)) %*% t(Xm));
+
+        #Frobenius norm of reconstruction error -> Euclidean norm
+        #Fn -> 1 x 1
+        Fn = sum(Xm*Xm);
+
+        #ss -> 1 x 1
+        ss = (Fn + ss2 - 2*ss3)/(n*m);
+
+       #calculating objective function relative change
+       ObjRelChng = abs(1 - ss/ssPrev);
+       #print("Objective Relative Change: " + ObjRelChng + ", Objective: " + ss);
+
+        #Reconstruction error
+        R = ((Z %*% t(C)) -  Xm);
+
+        #calculate the error
+        #TODO rethink calculation of reconstruction error ....
+        #1-Norm of reconstruction error - a big dense matrix
+        #RE -> n x m
+        RE = abs(sum(R)/sum(Xm));
+        if (RE < REBest){
+            PC = C;
+            REBest = RE;
+        }
+        #print("ss: " + ss +" = Fn( "+ Fn +" ) + ss2( " + ss2  +" ) - 2*ss3( " + ss3 + " ), Reconstruction Error: " + RE);
+
+        ssPrev = ss;
+        i = i+1;
+    }
+    print("Objective Relative Change: " + ObjRelChng);
+    print ("Number of iterations: " + i + ", Reconstruction Err: " + REBest);
+
+    # reconstructs data
+    # RD -> n x k
+    RD = X %*% PC;
+
+    # calculate eigenvalues - principle component variance
+    RDMean = colMeans(RD);
+    V = t(colMeans(RD*RD) - (RDMean*RDMean));
+
+    # sorting eigenvalues and eigenvectors in decreasing order
+    V_decr_idx = order(target=V,by=1,decreasing=TRUE,index.return=TRUE);
+    VF_decr = table(seq(1,nrow(V)),V_decr_idx);
+    V = VF_decr %*% V;
+    PC = PC %*% VF_decr;
+
+    # writing principal components
+    # write(PC, fileC, format=fmt0);
+    # writing eigen values/pc variance
+    # write(V, fileV, format=fmt0);
+    }

Review comment:
       add a newline to make GitHub happy.

##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   	TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# X  	 	String ---      location to read the matrix X input matrix
+# k      	Int    ---      indicates dimension of the new vector space constructed from eigen vectors
+# tolobj 	Int    0.00001  objective function tolerance value to stop ppca algorithm
+# tolrecerr	Int    0.02     reconstruction error tolerance value to stop the algorithm
+# iter   	Int    10       maximum number of iterations
+# fmt    	String 'text'   output format of results PPCA such as "text" or "csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100
+# ---------------------------------------------------------------------------------------------
+# OUTPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# C     	Matrix  ---     principal components
+# V      	Matrix  ---     eigenvalues / eigenvalues of principal components
+#
+
+
+m_pcaa = function(Matrix[Double] X, int iter = 10, double tolobj = 0.00001, double tolrecerr = 0.02, String fmt0 = "text") return(Matrix[Double] PC, Matrix[Double] V)
+    {
+    k = ncol(X)
+    n = nrow(X);
+    m = ncol(X);
+
+    #initializing principal components matrix
+    C =  rand(rows=m, cols=k, pdf="normal");
+    ss = rand(rows=1, cols=1, pdf="normal");
+    ss = as.scalar(ss);
+    ssPrev = ss;
+
+    # best selected principle components - with the lowest reconstruction error
+    PC = C;
+
+    # initilizing reconstruction error
+    RE = tolrecerr+1;
+    REBest = RE;
+
+    Z = matrix(0,rows=1,cols=1);
+
+    #Objective function value
+    ObjRelChng = tolobj+1;
+
+    # mean centered input matrix - dim -> [n,m]
+    Xm = X - colMeans(X);
+
+    #I -> k x k
+    ITMP = matrix(1,rows=k,cols=1);
+    I = diag(ITMP);
+
+    i = 0;
+    while (i < iter & ObjRelChng > tolobj & RE > tolrecerr){
+        #Estimation step - Covariance matrix
+        #M -> k x k
+        M = t(C) %*% C + I*ss;
+
+        #Auxilary matrix with n latent variables
+        # Z -> n x k
+        Z = Xm %*% (C %*% inv(M));
+
+        #ZtZ -> k x k
+        ZtZ = t(Z) %*% Z + inv(M)*ss;
+
+        #XtZ -> m x k
+        XtZ = t(Xm) %*% Z;
+
+        #Maximization step
+        #C ->  m x k
+        ZtZ_sum = sum(ZtZ); #+n*inv(M));
+        C = XtZ/ZtZ_sum;
+
+        #ss2 -> 1 x 1
+        ss2 = trace(ZtZ * (t(C) %*% C));
+
+        #ss3 -> 1 x 1
+        ss3 = sum((Z %*% t(C)) %*% t(Xm));
+
+        #Frobenius norm of reconstruction error -> Euclidean norm
+        #Fn -> 1 x 1
+        Fn = sum(Xm*Xm);
+
+        #ss -> 1 x 1
+        ss = (Fn + ss2 - 2*ss3)/(n*m);
+
+       #calculating objective function relative change
+       ObjRelChng = abs(1 - ss/ssPrev);
+       #print("Objective Relative Change: " + ObjRelChng + ", Objective: " + ss);
+
+        #Reconstruction error
+        R = ((Z %*% t(C)) -  Xm);
+
+        #calculate the error
+        #TODO rethink calculation of reconstruction error ....
+        #1-Norm of reconstruction error - a big dense matrix
+        #RE -> n x m
+        RE = abs(sum(R)/sum(Xm));
+        if (RE < REBest){
+            PC = C;
+            REBest = RE;
+        }
+        #print("ss: " + ss +" = Fn( "+ Fn +" ) + ss2( " + ss2  +" ) - 2*ss3( " + ss3 + " ), Reconstruction Error: " + RE);
+
+        ssPrev = ss;
+        i = i+1;
+    }
+    print("Objective Relative Change: " + ObjRelChng);
+    print ("Number of iterations: " + i + ", Reconstruction Err: " + REBest);
+
+    # reconstructs data
+    # RD -> n x k
+    RD = X %*% PC;
+
+    # calculate eigenvalues - principle component variance
+    RDMean = colMeans(RD);
+    V = t(colMeans(RD*RD) - (RDMean*RDMean));
+
+    # sorting eigenvalues and eigenvectors in decreasing order
+    V_decr_idx = order(target=V,by=1,decreasing=TRUE,index.return=TRUE);
+    VF_decr = table(seq(1,nrow(V)),V_decr_idx);
+    V = VF_decr %*% V;
+    PC = PC %*% VF_decr;
+
+    # writing principal components
+    # write(PC, fileC, format=fmt0);
+    # writing eigen values/pc variance
+    # write(V, fileV, format=fmt0);

Review comment:
       remove these write statement lines.

##########
File path: src/test/scripts/functions/builtin/PPCA.dml
##########
@@ -0,0 +1,25 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+X = read($1)
+PC = ppca(X = X)
+write(PC, V)
+

Review comment:
       PPCA outputs two variables.
   currently when you try to write this to an output we unfortunally do not throw an error, but also the write statement is simply ignored.
   Therefore this code will run fine. but ultimately be buggy.
   Please parse two results from PPCA, and write both out.

##########
File path: src/test/java/org/apache/sysds/test/functions/builtin/BuiltinPPCATest.java
##########
@@ -0,0 +1,49 @@
+package org.apache.sysds.test.functions.builtin;

Review comment:
       License

##########
File path: src/test/java/org/apache/sysds/test/functions/builtin/BuiltinPPCATest.java
##########
@@ -0,0 +1,49 @@
+package org.apache.sysds.test.functions.builtin;
+
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.junit.Test;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class BuiltinPPCATest extends AutomatedTestBase {
+    private final static String TEST_NAME = "PPCA";
+    private final static String TEST_DIR = "functions/builtin/";
+    private final static String TEST_CLASS_DIR = TEST_DIR + BuiltinPPCATest.class.getSimpleName() + "/";
+
+    private final static int rows = 100;
+    private final static int cols = 10;
+
+    @Override
+    public void setUp(){
+        addTestConfiguration(TEST_NAME, new TestConfiguration(TEST_CLASS_DIR, TEST_NAME, new String[]{"B"}));
+    }
+
+    @Test
+    public void testPPCA(){ runPPCATest(); }
+
+
+    private void runPPCATest(){
+
+        loadTestConfiguration(getTestConfiguration(TEST_NAME));
+        String HOME = SCRIPT_DIR + TEST_DIR;
+        fullDMLScriptName = HOME + TEST_NAME + ".dml";
+        List<String> proArgs = new ArrayList<>();
+
+        proArgs.add("-explain");
+        proArgs.add("-stats");
+        proArgs.add("-args");
+        proArgs.add(input("X"));
+        proArgs.add(output("PC"));
+        proArgs.add(output("V"));
+        programArgs = proArgs.toArray(new String[proArgs.size()]);
+        double[][] X = getRandomMatrix(rows, cols, 0, 1, 0.8, -1);
+        writeInputMatrixWithMTD("X", X, true);
+
+
+        runTest(true, EXCEPTION_NOT_EXPECTED, null, -1);
+

Review comment:
       same points as the other test.

##########
File path: src/test/java/org/apache/sysds/test/functions/builtin/BuiltinLassoTest.java
##########
@@ -0,0 +1,55 @@
+package org.apache.sysds.test.functions.builtin;
+
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.junit.Test;
+
+import java.util.ArrayList;
+import java.util.List;
+
+
+public class BuiltinLassoTest extends AutomatedTestBase{
+
+    private final static String TEST_NAME = "lasso";
+    private final static String TEST_DIR = "functions/builtin/";
+    private final static String TEST_CLASS_DIR = TEST_DIR + BuiltinLassoTest.class.getSimpleName() + "/";
+
+    private final static int rows = 100;
+    private final static int cols = 10;
+
+    @Override
+    public void setUp(){
+        addTestConfiguration(TEST_NAME, new TestConfiguration(TEST_CLASS_DIR, TEST_NAME, new String[]{"B"}));
+    }
+
+    @Test
+    public void testLasso(){ runLassoTest(); }
+
+
+    private void runLassoTest(){
+
+        loadTestConfiguration(getTestConfiguration(TEST_NAME));
+        String HOME = SCRIPT_DIR + TEST_DIR;
+        fullDMLScriptName = HOME + TEST_NAME + ".dml";
+        List<String> proArgs = new ArrayList<>();
+
+
+        proArgs.add("-explain");
+        proArgs.add("-stats");

Review comment:
       please remove -explain and -stats once you are done with checking the algorithm and upgrading the test.

##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   	TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# X  	 	String ---      location to read the matrix X input matrix
+# k      	Int    ---      indicates dimension of the new vector space constructed from eigen vectors
+# tolobj 	Int    0.00001  objective function tolerance value to stop ppca algorithm
+# tolrecerr	Int    0.02     reconstruction error tolerance value to stop the algorithm
+# iter   	Int    10       maximum number of iterations
+# fmt    	String 'text'   output format of results PPCA such as "text" or "csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100
+# ---------------------------------------------------------------------------------------------
+# OUTPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# C     	Matrix  ---     principal components
+# V      	Matrix  ---     eigenvalues / eigenvalues of principal components
+#
+
+
+m_pcaa = function(Matrix[Double] X, int iter = 10, double tolobj = 0.00001, double tolrecerr = 0.02, String fmt0 = "text") return(Matrix[Double] PC, Matrix[Double] V)

Review comment:
       move return to new line




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] SvenCelin commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
SvenCelin commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r497410190



##########
File path: src/test/java/org/apache/sysds/test/functions/builtin/BuiltinLassoTest.java
##########
@@ -0,0 +1,55 @@
+package org.apache.sysds.test.functions.builtin;
+
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.junit.Test;
+
+import java.util.ArrayList;
+import java.util.List;
+
+
+public class BuiltinLassoTest extends AutomatedTestBase{
+
+    private final static String TEST_NAME = "lasso";
+    private final static String TEST_DIR = "functions/builtin/";
+    private final static String TEST_CLASS_DIR = TEST_DIR + BuiltinLassoTest.class.getSimpleName() + "/";
+
+    private final static int rows = 100;
+    private final static int cols = 10;
+
+    @Override
+    public void setUp(){
+        addTestConfiguration(TEST_NAME, new TestConfiguration(TEST_CLASS_DIR, TEST_NAME, new String[]{"B"}));
+    }
+
+    @Test
+    public void testLasso(){ runLassoTest(); }
+
+
+    private void runLassoTest(){
+
+        loadTestConfiguration(getTestConfiguration(TEST_NAME));
+        String HOME = SCRIPT_DIR + TEST_DIR;
+        fullDMLScriptName = HOME + TEST_NAME + ".dml";
+        List<String> proArgs = new ArrayList<>();
+
+
+        proArgs.add("-explain");
+        proArgs.add("-stats");

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] SvenCelin commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
SvenCelin commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r497409273



##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   	TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# X  	 	String ---      location to read the matrix X input matrix
+# k      	Int    ---      indicates dimension of the new vector space constructed from eigen vectors
+# tolobj 	Int    0.00001  objective function tolerance value to stop ppca algorithm
+# tolrecerr	Int    0.02     reconstruction error tolerance value to stop the algorithm
+# iter   	Int    10       maximum number of iterations
+# fmt    	String 'text'   output format of results PPCA such as "text" or "csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100
+# ---------------------------------------------------------------------------------------------
+# OUTPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME   TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# C     	Matrix  ---     principal components
+# V      	Matrix  ---     eigenvalues / eigenvalues of principal components
+#
+
+
+m_pcaa = function(Matrix[Double] X, int iter = 10, double tolobj = 0.00001, double tolrecerr = 0.02, String fmt0 = "text") return(Matrix[Double] PC, Matrix[Double] V)

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] SvenCelin commented on a change in pull request #1071: Lasso and PPCA

Posted by GitBox <gi...@apache.org>.
SvenCelin commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r497412061



##########
File path: src/test/java/org/apache/sysds/test/functions/builtin/BuiltinLassoTest.java
##########
@@ -0,0 +1,55 @@
+package org.apache.sysds.test.functions.builtin;

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org