You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemds.apache.org by ss...@apache.org on 2020/09/02 14:34:42 UTC

[systemds] branch master updated: [MINOR][DOC] Updates in built-in docs

This is an automated email from the ASF dual-hosted git repository.

ssiddiqi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/systemds.git


The following commit(s) were added to refs/heads/master by this push:
     new f294099  [MINOR][DOC] Updates in built-in docs
f294099 is described below

commit f2940990d970ad3703d6ac3a45a7bb454fc9c5ba
Author: Shafaq Siddiqi <sh...@tugraz.at>
AuthorDate: Wed Sep 2 16:26:16 2020 +0200

    [MINOR][DOC] Updates in built-in docs
    
    This commit updates the doc file for new builtin "smote "
    The commit also introduces,
    1. A sanity check in mice.dml
    2. minor fix in calculation of iterations in smote.dml
    3. verbose variable in imputeByFD.dml
---
 docs/site/builtins-reference.md                 | 60 +++++++++++++++++++++----
 scripts/builtin/imputeByFD.dml                  |  5 ++-
 scripts/builtin/mice.dml                        |  6 ++-
 scripts/builtin/smote.dml                       |  4 +-
 src/test/scripts/functions/builtin/imputeFD.dml |  2 +-
 src/test/scripts/functions/builtin/smote.dml    |  2 +-
 6 files changed, 65 insertions(+), 14 deletions(-)

diff --git a/docs/site/builtins-reference.md b/docs/site/builtins-reference.md
index 6d772f3..66b2299 100644
--- a/docs/site/builtins-reference.md
+++ b/docs/site/builtins-reference.md
@@ -50,6 +50,7 @@ limitations under the License.
     * [`pnmf`-Function](#pnmf-function)
     * [`scale`-Function](#scale-function)
     * [`sigmoid`-Function](#sigmoid-function)
+    * [`smote`-Function](#smote-function)
     * [`steplm`-Function](#steplm-function)
     * [`slicefinder`-Function](#slicefinder-function)
     * [`normalize`-Function](#normalize-function)
@@ -535,14 +536,14 @@ using robust functional dependencies.
 ### Usage
 
 ```r
-imputeByFD(F, sourceAttribute, targetAttribute, threshold)
+imputeByFD(X, sourceAttribute, targetAttribute, threshold)
 ```
 
 ### Arguments
 
 | Name      | Type    | Default  | Description |
 | :-------- | :------ | -------- | :---------- |
-| F         | String  | --       | A data frame |
+| X         | Matrix[Double]  | --       | Matrix of feature vectors (recoded matrix for non-numeric values) |
 | source    | Integer | --       | Source attribute to use for imputation and error correction |
 | target    | Integer | --       | Attribute to be fixed |
 | threshold | Double  | --       | threshold value in interval [0, 1] for robust FDs |
@@ -551,8 +552,14 @@ imputeByFD(F, sourceAttribute, targetAttribute, threshold)
 
 | Type   | Description |
 | :----- | :---------- |
-| String | Frame with possible imputations |
+| Matrix[Double] | Matrix with possible imputations |
 
+### Example
+
+```r
+X = matrix("1 1 1 2 4 5 5 3 3 NaN 4 5 4 1", rows=7, cols=2) 
+imputeByFD(X = X, source = 1, target = 2, threshold = 0.6, verbose = FALSE)
+```
 
 ## `KMeans`-Function
 
@@ -777,25 +784,24 @@ mice(F, cMask, iter, complete, verbose)
 
 | Name     | Type           | Default  | Description |
 | :------- | :------------- | -------- | :---------- |
-| F        | Frame[String]  | required | Data Frame with one-dimensional row matrix with N columns where N>1. |
+| X        | Matrix[Double]  | required | Data Matrix (Recoded Matrix for categorical features), ncol(X) > 1|
 | cMask    | Matrix[Double] | required | 0/1 row vector for identifying numeric (0) and categorical features (1) with one-dimensional row matrix with column = ncol(F). |
 | iter     | Integer        | `3`      | Number of iteration for multiple imputations. |
-| complete | Integer        | `3`      | A complete dataset generated though a specific iteration. |
 | verbose  | Boolean        | `FALSE`  | Boolean value. |
 
 ### Returns
 
 | Type           | Description |
 | :------------- | :---------- |
-| Frame[String]  | imputed dataset. |
-| Frame[String]  | A complete dataset generated though a specific iteration. |
+| Matrix[Double]  | imputed dataset. |
+
 
 ### Example
 
 ```r
-F = as.frame(matrix("4 3 2 8 7 8 5", rows=1, cols=7))
+F = matrix("4 3 NaN 8 7 8 5 NaN 6", rows=3, cols=3)
 cMask = round(rand(rows=1,cols=ncol(F),min=0,max=1))
-[dataset, singleSet] = mice(F, cMask, iter = 3, complete = 3, verbose = FALSE)
+dataset = mice(F, cMask, iter = 3, verbose = FALSE)
 ```
 
 ## `multiLogReg`-Function
@@ -936,7 +942,43 @@ sigmoid(X)
 X = rand (rows = 20, cols = 10)
 Y = sigmoid(X)
 ```
+## `smote`-Function
+
+The `smote`-function (Synthetic Minority Oversampling Technique) implements a classical techniques for handling class imbalance.
+The  built-in takes the samples from minority class and over-sample them by generating the synthesized samples.
+The built-in accepts two parameters s and k. The parameter s define the number of synthesized samples to be generated
+ i.e., over-sample the minority class by s time, where s is the multiple of 100. For given 40 samples of minority class and
+ s = 200 the smote will generate the 80 synthesized samples to over-sample the class by 200 percent. The parameter k is used to generate the 
+ k nearest neighbours for each minority class sample and then the neighbours are chosen randomly in synthesis process.
+
+### Usage
+
+```r
+smote(X, s, k, verbose);
+```
+
+### Arguments
+
+| Name    | Type           | Default  | Description |
+| :------ | :------------- | -------- | :---------- |
+| X       | Matrix[Double] | required | Matrix of feature vectors of minority class samples |
+| s       | Integer | 200 | Amount of SMOTE (percentage of oversampling), integral multiple of 100 |
+| k    | Integer        | `1`      | Number of nearest neighbour
+| verbose | Boolean        | `TRUE`   | If `TRUE` print messages are activated |
+
+### Returns
+
+| Type           | Description |
+| :------------- | :---------- |
+| Matrix[Double] | Matrix of (N/100) * X synthetic minority class samples 
+
+
+### Example
 
+```r
+X = rand (rows = 50, cols = 10)
+B = smote(X = X, s=200, k=3, verbose=TRUE);
+```
 ## `steplm`-Function
 
 The `steplm`-function (stepwise linear regression) implements a classical forward feature selection method.
diff --git a/scripts/builtin/imputeByFD.dml b/scripts/builtin/imputeByFD.dml
index 01281d2..2791eb3 100644
--- a/scripts/builtin/imputeByFD.dml
+++ b/scripts/builtin/imputeByFD.dml
@@ -39,7 +39,7 @@
 # X               Double   ---        Matrix with possible imputations 
 
 
-m_imputeByFD = function(Matrix[Double] X, Integer sourceAttribute, Integer targetAttribute, Double threshold)
+m_imputeByFD = function(Matrix[Double] X, Integer sourceAttribute, Integer targetAttribute, Double threshold, Boolean verbose = FALSE)
   return(Matrix[Double] X)
 {
   # sanity checks
@@ -51,6 +51,9 @@ m_imputeByFD = function(Matrix[Double] X, Integer sourceAttribute, Integer targe
  
   # impute missing values and fix errors
   X[,targetAttribute] = imputeAndCorrect(X[,sourceAttribute], X[,targetAttribute], threshold) 
+
+  if(verbose)
+    print("output \n"+toString(X))
 }
 
 imputeAndCorrect = function(Matrix[Double] X, Matrix[Double] Y, Double threshold)
diff --git a/scripts/builtin/mice.dml b/scripts/builtin/mice.dml
index f40ac81..96c41d4 100644
--- a/scripts/builtin/mice.dml
+++ b/scripts/builtin/mice.dml
@@ -25,7 +25,7 @@
 # ---------------------------------------------------------------------------------------------
 # NAME            TYPE    DEFAULT     MEANING
 # ---------------------------------------------------------------------------------------------
-# X               String    ---        Data Matrix (Recoded Matrix for categorical features)
+# X               Double    ---        Data Matrix (Recoded Matrix for categorical features)
 # cMask           Double    ---        A 0/1 row vector for identifying numeric (0) and categorical features (1)
 # iter            Integer    3         Number of iteration for multiple imputations 
 # ---------------------------------------------------------------------------------------------
@@ -43,6 +43,10 @@
 m_mice= function(Matrix[Double] X, Matrix[Double] cMask, Integer iter = 3, Boolean verbose = FALSE)
   return(Matrix[Double] output)
 {
+  if(ncol(X) < 2)
+    stop("MICE can not be applied on single vectors.
+         expected number of columns > 1 found: "+ncol(X))
+    
   lastIndex = ncol(X);
   sumMax = sum(cMask);
   
diff --git a/scripts/builtin/smote.dml b/scripts/builtin/smote.dml
index 14947ea..857120e 100644
--- a/scripts/builtin/smote.dml
+++ b/scripts/builtin/smote.dml
@@ -56,7 +56,7 @@ return (Matrix[Double] Y) {
   }
   
   # number of synthetic samples from each minority class sample
-  iter = (s/100)-1
+  iter = (s/100)
   # matrix to store synthetic samples
   synthetic_samples = matrix(0, 0, ncol(X))
   while(iter > 0)
@@ -79,6 +79,8 @@ return (Matrix[Double] Y) {
   }
 
   Y = synthetic_samples
+  if(verbose)
+    print(nrow(Y)+ " synthesized samples generated.")
 }
   
 
diff --git a/src/test/scripts/functions/builtin/imputeFD.dml b/src/test/scripts/functions/builtin/imputeFD.dml
index 4782921..bcb61f1 100644
--- a/src/test/scripts/functions/builtin/imputeFD.dml
+++ b/src/test/scripts/functions/builtin/imputeFD.dml
@@ -31,7 +31,7 @@ for(i in 1: ncol(F)) {
 jspecR = "{ids:true, recode:["+s+"]}";
 [X, M] = transformencode(target=F, spec=jspecR);
 # call the method
-Y = imputeByFD(X, $2, $3, $4);
+Y = imputeByFD(X, $2, $3, $4, FALSE);
 
 # getting the actual data back
 dF = transformdecode(target=Y, spec=jspecR, meta=M);
diff --git a/src/test/scripts/functions/builtin/smote.dml b/src/test/scripts/functions/builtin/smote.dml
index dc33f18..3891c98 100644
--- a/src/test/scripts/functions/builtin/smote.dml
+++ b/src/test/scripts/functions/builtin/smote.dml
@@ -21,7 +21,7 @@
 
 
 A = read($X);
-B = smote(X = A, s=$S, k=$K);
+B = smote(X = A, s=$S, k=$K, verbose=TRUE);
 
 # test if all point fall in same cluster (closed to each other)
 # read some new data T != A