You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemml.apache.org by GitBox <gi...@apache.org> on 2020/06/09 18:38:27 UTC
[GitHub] [systemml] Shafaq-Siddiqi opened a new pull request #972: [MINOR] MICE now accepts a recoded matrix as input
Shafaq-Siddiqi opened a new pull request #972:
URL: https://github.com/apache/systemml/pull/972
MICE now accept a matrix as input instead of frame. The categorical features could be recoded first before calling MICE.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemml] Shafaq-Siddiqi commented on a change in pull request #972: [MINOR] MICE now accepts a recoded matrix as input
Posted by GitBox <gi...@apache.org>.
Shafaq-Siddiqi commented on a change in pull request #972:
URL: https://github.com/apache/systemml/pull/972#discussion_r437642060
##########
File path: src/test/java/org/apache/sysds/test/functions/builtin/BuiltinMiceTest.java
##########
@@ -50,33 +50,47 @@ public void testMiceMixCP() {
runMiceNominalTest(mask, 1, LopProperties.ExecType.CP);
}
+// @Test
Review comment:
@mboehm7 as discussed, the Spark tests are commented out due to "NullPointerException" in one-hot-encoding. The error could be reproduced by running the spark tests.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemml] phaniarnab commented on a change in pull request #972: [MINOR] MICE now accepts a recoded matrix as input
Posted by GitBox <gi...@apache.org>.
phaniarnab commented on a change in pull request #972:
URL: https://github.com/apache/systemml/pull/972#discussion_r439621028
##########
File path: src/test/java/org/apache/sysds/test/functions/builtin/BuiltinMiceTest.java
##########
@@ -36,7 +36,7 @@
private static final String TEST_CLASS_DIR = TEST_DIR + BuiltinMiceTest.class.getSimpleName() + "/";
private final static String DATASET = SCRIPT_DIR +"functions/transform/input/ChickWeight.csv";
- private final static double eps = 0.2;
+ private final static double eps = 0.16;
private final static int iter = 3;
private final static int com = 2;
Review comment:
Please remove the unused variable com.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemml] Shafaq-Siddiqi commented on pull request #972: [MINOR] MICE now accepts a recoded matrix as input
Posted by GitBox <gi...@apache.org>.
Shafaq-Siddiqi commented on pull request #972:
URL: https://github.com/apache/systemml/pull/972#issuecomment-641962097
> Can this file https://github.com/apache/systemml/blob/master/dev/docs/builtins-reference.md#mice-function also be updated?
>
> Thank you.
Yes, I will update this once the PR is merged.
Thank you.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemml] Shafaq-Siddiqi commented on a change in pull request #972: [MINOR] MICE now accepts a recoded matrix as input
Posted by GitBox <gi...@apache.org>.
Shafaq-Siddiqi commented on a change in pull request #972:
URL: https://github.com/apache/systemml/pull/972#discussion_r438096170
##########
File path: src/test/java/org/apache/sysds/test/functions/builtin/BuiltinMiceTest.java
##########
@@ -50,33 +50,47 @@ public void testMiceMixCP() {
runMiceNominalTest(mask, 1, LopProperties.ExecType.CP);
}
+// @Test
Review comment:
The codegen tests in PR are failing here while working fine locally. The trace is as follows, any help/hint is appreciated.
[ERROR] Crashed tests:
[ERROR] org.apache.sysds.test.functions.codegen.RowAggTmplTest
[ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:690)
[ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.access$600(ForkStarter.java:118)
[ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter$2.call(ForkStarter.java:447)
[ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter$2.call(ForkStarter.java:423)
[ERROR] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[ERROR] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[ERROR] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemml] phaniarnab commented on a change in pull request #972: [MINOR] MICE now accepts a recoded matrix as input
Posted by GitBox <gi...@apache.org>.
phaniarnab commented on a change in pull request #972:
URL: https://github.com/apache/systemml/pull/972#discussion_r439627882
##########
File path: scripts/builtin/mice.dml
##########
@@ -19,266 +19,176 @@
#
#-------------------------------------------------------------
-# Builtin function Implements Multiple Imputation using Chained Equations (MICE) for nominal data
+# Built-in function Implements Multiple Imputation using Chained Equations (MICE)
#
# INPUT PARAMETERS:
# ---------------------------------------------------------------------------------------------
# NAME TYPE DEFAULT MEANING
# ---------------------------------------------------------------------------------------------
-# F String --- Data Frame
-# cMask Double --- A 0/1 row vector for identifying numeric (0) adn categorical features (1)
+# X String --- Data Matrix (Recoded Matrix for categorical features)
+# cMask Double --- A 0/1 row vector for identifying numeric (0) and categorical features (1)
# iter Integer 3 Number of iteration for multiple imputations
-# complete Integer 3 A complete dataset generated though a specific iteration
# ---------------------------------------------------------------------------------------------
#Output(s)
# ---------------------------------------------------------------------------------------------
# NAME TYPE DEFAULT MEANING
# ---------------------------------------------------------------------------------------------
-# dataset Double --- imputed dataset
-# singleSet Double --- A complete dataset generated though a specific iteration
-
-# Assumption missing value are represented with empty string i.e ",," in csv file
-# variables with suffix n are storing continuous/numeric data and variables with suffix c are storing categorical data
-s_mice= function(Frame[String] F, Matrix[Double] cMask, Integer iter = 3, Integer complete = 3, Boolean verbose = FALSE)
-return(Frame[String] dataset, Frame[String] singleSet)
-{
-
- if(ncol(F) == 1)
- stop("invalid argument: can not apply mice on single column")
-
- if(complete > iter)
- complete = iter
+# output Double --- imputed dataset
- # adding a temporary feature (in-case all attributes are of same type)
- F = cbind(F, as.frame(matrix(1,nrow(F), 1)))
- cMask = cbind(cMask, matrix(1,1,1))
+# Assumption missing value are represented with empty string i.e ",," in CSV file
+# variables with suffix n are storing continuos/numeric data and variables with suffix c are storing categorical data
+m_mice= function(Matrix[Double] X, Matrix[Double] cMask, Integer iter = 3, Boolean verbose = FALSE)
+return(Matrix[Double] output)
+{
- n = nrow(F)
- row = n*complete;
- col = ncol(F)
- Result = matrix(0, rows=1, cols = col)
- Mask_Result = matrix(0, rows=1, cols=col)
- scat = seq(1, ncol(cMask))
- cat = removeEmpty(target=scat, margin="rows", select=t(cMask))
+ lastIndex = ncol(X)
+ if(sum(cMask) == 0) # if all features are numeric add a categorical features
+ {
+ X = cbind(X, matrix(1, nrow(X), 1))
+ cMask = cbind(cMask, matrix(1, 1, 1))
+ }
+ else if(sum(cMask) == ncol(cMask)) # if all features are categorical add a numeric features
+ {
+ X = cbind(X, matrix(1, nrow(X), 1))
+ cMask = cbind(cMask, matrix(0, 1, 1))
+ }
- if(nrow(cat) == ncol(F))
- cMask[1,ncol(cMask)] = 0
-
- s=""
- for(i in 1: nrow(cat), check =0)
- s = s+as.integer(as.scalar(cat[i, 1]))+",";
-
+ # separate categorical and continuous features
+ nX = removeEmpty(target=X, margin="cols", select=(cMask==0))
+ cX = removeEmpty(target=X, margin="cols", select= cMask)
- # encoding categorical columns using recode transformation
- jspecR = "{ids:true, recode:["+s+"]}";
- [X, M] = transformencode(target=F, spec=jspecR);
-
- XO = replace(target=X, pattern=NaN, replacement=0);
-
- # remove categorical features and impute continuous features with mean
- eX_n = removeEmpty(target=X, margin="cols", select=(cMask==0))
- col_n = ncol(eX_n);
- # storing the mask/address of missing values
- Mask_n = is.na(eX_n);
- inverseMask_n = 1 - Mask_n;
- # replacing the empty cells in encoded data with 0
- eX_n = replace(target=eX_n, pattern=NaN, replacement=0);
- # filling the missing data with their means
- X2_n = eX_n+(Mask_n*colMeans(eX_n))
- # matrices for computing actul data
- p_n = table(seq(1, ncol(eX_n)), removeEmpty(target=scat, margin="rows", select=t(cMask==0)))
- if(ncol(p_n) < ncol(cMask))
- p_n = cbind(p_n, matrix(0, nrow(p_n), ncol(cMask)-ncol(p_n)))
- q = XO * cMask
+ # store the mask of numeric missing values
+ Mask_n = is.na(nX);
+ nX = replace(target=nX, pattern=NaN, replacement=0);
+ # initial mean imputation
+ X_n = nX+(Mask_n*colMeans(nX))
+
+ # store the mask of categorical missing values
+ Mask_c = is.na(cX);
+ cX = replace(target=cX, pattern=NaN, replacement=0);
+ colMode = colMode(cX)
+ # initial mode imputation
+ X_c = cX+(Mask_c*colMode)
- # Taking out the categorical features for initial imputation by mode
- eX_c = removeEmpty(target = q, margin = "cols")
- col_c = ncol(eX_c);
- eX_c2 = removeEmpty(target = eX_c, margin = "rows", select = (rowSums(eX_c != 0)==col_c))
- colMod = matrix(0, 1, ncol(eX_c))
- # compute columnwise mode
- parfor(i in 1: col_c) {
- f = eX_c2[, i] # adding one in data for dealing with zero category
- cat_counts = table(f, 1, n, 1); # counts for each category
- mode = as.scalar(rowIndexMax(t(cat_counts)));
- colMod[1,i] = mode
- }
+ # reconstruct original matrix using sparse matrices p and q
+ p = table(seq(1, ncol(nX)), removeEmpty(target=seq(1, ncol(cMask)), margin="rows", select=t(cMask==0)), ncol(nX), ncol(X))
+ q = table(seq(1, ncol(cX)), removeEmpty(target=seq(1, ncol(cMask)), margin="rows", select=t(cMask)), ncol(cX), ncol(X))
+ X1 = (X_n %*% p) + (X_c %*% q)
+ Mask1 = is.na(X)
- # find the mask of missing values
- tmpMask_c = (eX_c==0) * colMod # fill missing values with mode
+ X = replace(target=X, pattern=NaN, replacement=0);
+ d = ncol(X1)
+ n = nrow(X1)
- # Generate a matrix of actual length
- p_c = table(seq(1, ncol(tmpMask_c)), removeEmpty(target=scat, margin ="rows", select=t(cMask)), ncol(tmpMask_c), ncol(cMask))
-
- Mask_c = tmpMask_c %*% p_c
- inverseMask_c = Mask_c == 0
- r = X2_n %*% p_n
- qr = q + r
- X2_c = qr + Mask_c
- Mask_c = Mask_c != 0
+ # compute index of categorical features
+ encodeIndex = removeEmpty(target=t(seq(1, ncol(X1))), margin="cols", select=cMask)
+ s=""
+ if(ncol(encodeIndex) == 1)
+ s = as.integer(as.scalar(encodeIndex[1, 1]))
+ else{
+ for(i in 1: ncol(encodeIndex)-1)
+ s = s+as.integer(as.scalar(encodeIndex[1, i]))+",";
+ s = s+as.integer(as.scalar(encodeIndex[1, ncol(encodeIndex)]))
+ }
- # one-hot encoding of categorical features
+ # specifications for one-hot encoding of categorical features
jspecDC = "{ids:true, dummycode:["+s+"]}";
- [dX, dM] = transformencode(target=as.frame(X2_c), spec=jspecDC);
-
- # recoding of metadata of OHE features to get the number of distinct elements
- [metaTransform, metaTransformMeta] = transformencode(target=dM, spec=jspecR);
- metaTransform = replace(target=metaTransform, pattern=NaN, replacement=0)
- # counting distinct elements in each categorical feature
- dcDistincts = colMaxs(metaTransform)
- dist = dcDistincts + (1-cMask)
-
- # creating a mask matrix of OHE features
- dXMask = matrix(0, 1, ncol(dX))
- index = 1
- for(k in 1:col) {
- nDistk = as.scalar(dcDistincts[1,k]);
- if(nDistk != 0) {
- dXMask[1,index:(index+nDistk-1)] = matrix(1,1,nDistk)
- index += nDistk;
- }
- else
- index += 1
- }
- #multiple imputations
- for(k in 1:iter)
+ for(k in 1:iter) # start iterative imputation
{
- Mask_Filled_n = Mask_n;
- Mask_Filled_c = Mask_c
- in_n = 1; in_c = 1; i=1; j=1; # variables for index selection
- while(i <= ncol(dX))
+ Mask_Filled = Mask1
+ inverseMask = Mask1 == 0
+ # OHE of categorical features
+ [dX, dM] = transformencode(target=as.frame(X1), spec=jspecDC);
Review comment:
Sorry for being naive, can we not move this transformencode outside, to have no frame operations?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemml] j143 commented on pull request #972: [MINOR] MICE now accepts a recoded matrix as input
Posted by GitBox <gi...@apache.org>.
j143 commented on pull request #972:
URL: https://github.com/apache/systemml/pull/972#issuecomment-641692191
Can this file https://github.com/apache/systemml/blob/master/dev/docs/builtins-reference.md#mice-function also be updated?
Thank you.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org