You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemds.apache.org by ss...@apache.org on 2020/09/02 14:34:42 UTC
[systemds] branch master updated: [MINOR][DOC] Updates in built-in
docs
This is an automated email from the ASF dual-hosted git repository.
ssiddiqi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/systemds.git
The following commit(s) were added to refs/heads/master by this push:
new f294099 [MINOR][DOC] Updates in built-in docs
f294099 is described below
commit f2940990d970ad3703d6ac3a45a7bb454fc9c5ba
Author: Shafaq Siddiqi <sh...@tugraz.at>
AuthorDate: Wed Sep 2 16:26:16 2020 +0200
[MINOR][DOC] Updates in built-in docs
This commit updates the doc file for new builtin "smote "
The commit also introduces,
1. A sanity check in mice.dml
2. minor fix in calculation of iterations in smote.dml
3. verbose variable in imputeByFD.dml
---
docs/site/builtins-reference.md | 60 +++++++++++++++++++++----
scripts/builtin/imputeByFD.dml | 5 ++-
scripts/builtin/mice.dml | 6 ++-
scripts/builtin/smote.dml | 4 +-
src/test/scripts/functions/builtin/imputeFD.dml | 2 +-
src/test/scripts/functions/builtin/smote.dml | 2 +-
6 files changed, 65 insertions(+), 14 deletions(-)
diff --git a/docs/site/builtins-reference.md b/docs/site/builtins-reference.md
index 6d772f3..66b2299 100644
--- a/docs/site/builtins-reference.md
+++ b/docs/site/builtins-reference.md
@@ -50,6 +50,7 @@ limitations under the License.
* [`pnmf`-Function](#pnmf-function)
* [`scale`-Function](#scale-function)
* [`sigmoid`-Function](#sigmoid-function)
+ * [`smote`-Function](#smote-function)
* [`steplm`-Function](#steplm-function)
* [`slicefinder`-Function](#slicefinder-function)
* [`normalize`-Function](#normalize-function)
@@ -535,14 +536,14 @@ using robust functional dependencies.
### Usage
```r
-imputeByFD(F, sourceAttribute, targetAttribute, threshold)
+imputeByFD(X, sourceAttribute, targetAttribute, threshold)
```
### Arguments
| Name | Type | Default | Description |
| :-------- | :------ | -------- | :---------- |
-| F | String | -- | A data frame |
+| X | Matrix[Double] | -- | Matrix of feature vectors (recoded matrix for non-numeric values) |
| source | Integer | -- | Source attribute to use for imputation and error correction |
| target | Integer | -- | Attribute to be fixed |
| threshold | Double | -- | threshold value in interval [0, 1] for robust FDs |
@@ -551,8 +552,14 @@ imputeByFD(F, sourceAttribute, targetAttribute, threshold)
| Type | Description |
| :----- | :---------- |
-| String | Frame with possible imputations |
+| Matrix[Double] | Matrix with possible imputations |
+### Example
+
+```r
+X = matrix("1 1 1 2 4 5 5 3 3 NaN 4 5 4 1", rows=7, cols=2)
+imputeByFD(X = X, source = 1, target = 2, threshold = 0.6, verbose = FALSE)
+```
## `KMeans`-Function
@@ -777,25 +784,24 @@ mice(F, cMask, iter, complete, verbose)
| Name | Type | Default | Description |
| :------- | :------------- | -------- | :---------- |
-| F | Frame[String] | required | Data Frame with one-dimensional row matrix with N columns where N>1. |
+| X | Matrix[Double] | required | Data Matrix (Recoded Matrix for categorical features), ncol(X) > 1|
| cMask | Matrix[Double] | required | 0/1 row vector for identifying numeric (0) and categorical features (1) with one-dimensional row matrix with column = ncol(F). |
| iter | Integer | `3` | Number of iteration for multiple imputations. |
-| complete | Integer | `3` | A complete dataset generated though a specific iteration. |
| verbose | Boolean | `FALSE` | Boolean value. |
### Returns
| Type | Description |
| :------------- | :---------- |
-| Frame[String] | imputed dataset. |
-| Frame[String] | A complete dataset generated though a specific iteration. |
+| Matrix[Double] | imputed dataset. |
+
### Example
```r
-F = as.frame(matrix("4 3 2 8 7 8 5", rows=1, cols=7))
+F = matrix("4 3 NaN 8 7 8 5 NaN 6", rows=3, cols=3)
cMask = round(rand(rows=1,cols=ncol(F),min=0,max=1))
-[dataset, singleSet] = mice(F, cMask, iter = 3, complete = 3, verbose = FALSE)
+dataset = mice(F, cMask, iter = 3, verbose = FALSE)
```
## `multiLogReg`-Function
@@ -936,7 +942,43 @@ sigmoid(X)
X = rand (rows = 20, cols = 10)
Y = sigmoid(X)
```
+## `smote`-Function
+
+The `smote`-function (Synthetic Minority Oversampling Technique) implements a classical techniques for handling class imbalance.
+The built-in takes the samples from minority class and over-sample them by generating the synthesized samples.
+The built-in accepts two parameters s and k. The parameter s define the number of synthesized samples to be generated
+ i.e., over-sample the minority class by s time, where s is the multiple of 100. For given 40 samples of minority class and
+ s = 200 the smote will generate the 80 synthesized samples to over-sample the class by 200 percent. The parameter k is used to generate the
+ k nearest neighbours for each minority class sample and then the neighbours are chosen randomly in synthesis process.
+
+### Usage
+
+```r
+smote(X, s, k, verbose);
+```
+
+### Arguments
+
+| Name | Type | Default | Description |
+| :------ | :------------- | -------- | :---------- |
+| X | Matrix[Double] | required | Matrix of feature vectors of minority class samples |
+| s | Integer | 200 | Amount of SMOTE (percentage of oversampling), integral multiple of 100 |
+| k | Integer | `1` | Number of nearest neighbour
+| verbose | Boolean | `TRUE` | If `TRUE` print messages are activated |
+
+### Returns
+
+| Type | Description |
+| :------------- | :---------- |
+| Matrix[Double] | Matrix of (N/100) * X synthetic minority class samples
+
+
+### Example
+```r
+X = rand (rows = 50, cols = 10)
+B = smote(X = X, s=200, k=3, verbose=TRUE);
+```
## `steplm`-Function
The `steplm`-function (stepwise linear regression) implements a classical forward feature selection method.
diff --git a/scripts/builtin/imputeByFD.dml b/scripts/builtin/imputeByFD.dml
index 01281d2..2791eb3 100644
--- a/scripts/builtin/imputeByFD.dml
+++ b/scripts/builtin/imputeByFD.dml
@@ -39,7 +39,7 @@
# X Double --- Matrix with possible imputations
-m_imputeByFD = function(Matrix[Double] X, Integer sourceAttribute, Integer targetAttribute, Double threshold)
+m_imputeByFD = function(Matrix[Double] X, Integer sourceAttribute, Integer targetAttribute, Double threshold, Boolean verbose = FALSE)
return(Matrix[Double] X)
{
# sanity checks
@@ -51,6 +51,9 @@ m_imputeByFD = function(Matrix[Double] X, Integer sourceAttribute, Integer targe
# impute missing values and fix errors
X[,targetAttribute] = imputeAndCorrect(X[,sourceAttribute], X[,targetAttribute], threshold)
+
+ if(verbose)
+ print("output \n"+toString(X))
}
imputeAndCorrect = function(Matrix[Double] X, Matrix[Double] Y, Double threshold)
diff --git a/scripts/builtin/mice.dml b/scripts/builtin/mice.dml
index f40ac81..96c41d4 100644
--- a/scripts/builtin/mice.dml
+++ b/scripts/builtin/mice.dml
@@ -25,7 +25,7 @@
# ---------------------------------------------------------------------------------------------
# NAME TYPE DEFAULT MEANING
# ---------------------------------------------------------------------------------------------
-# X String --- Data Matrix (Recoded Matrix for categorical features)
+# X Double --- Data Matrix (Recoded Matrix for categorical features)
# cMask Double --- A 0/1 row vector for identifying numeric (0) and categorical features (1)
# iter Integer 3 Number of iteration for multiple imputations
# ---------------------------------------------------------------------------------------------
@@ -43,6 +43,10 @@
m_mice= function(Matrix[Double] X, Matrix[Double] cMask, Integer iter = 3, Boolean verbose = FALSE)
return(Matrix[Double] output)
{
+ if(ncol(X) < 2)
+ stop("MICE can not be applied on single vectors.
+ expected number of columns > 1 found: "+ncol(X))
+
lastIndex = ncol(X);
sumMax = sum(cMask);
diff --git a/scripts/builtin/smote.dml b/scripts/builtin/smote.dml
index 14947ea..857120e 100644
--- a/scripts/builtin/smote.dml
+++ b/scripts/builtin/smote.dml
@@ -56,7 +56,7 @@ return (Matrix[Double] Y) {
}
# number of synthetic samples from each minority class sample
- iter = (s/100)-1
+ iter = (s/100)
# matrix to store synthetic samples
synthetic_samples = matrix(0, 0, ncol(X))
while(iter > 0)
@@ -79,6 +79,8 @@ return (Matrix[Double] Y) {
}
Y = synthetic_samples
+ if(verbose)
+ print(nrow(Y)+ " synthesized samples generated.")
}
diff --git a/src/test/scripts/functions/builtin/imputeFD.dml b/src/test/scripts/functions/builtin/imputeFD.dml
index 4782921..bcb61f1 100644
--- a/src/test/scripts/functions/builtin/imputeFD.dml
+++ b/src/test/scripts/functions/builtin/imputeFD.dml
@@ -31,7 +31,7 @@ for(i in 1: ncol(F)) {
jspecR = "{ids:true, recode:["+s+"]}";
[X, M] = transformencode(target=F, spec=jspecR);
# call the method
-Y = imputeByFD(X, $2, $3, $4);
+Y = imputeByFD(X, $2, $3, $4, FALSE);
# getting the actual data back
dF = transformdecode(target=Y, spec=jspecR, meta=M);
diff --git a/src/test/scripts/functions/builtin/smote.dml b/src/test/scripts/functions/builtin/smote.dml
index dc33f18..3891c98 100644
--- a/src/test/scripts/functions/builtin/smote.dml
+++ b/src/test/scripts/functions/builtin/smote.dml
@@ -21,7 +21,7 @@
A = read($X);
-B = smote(X = A, s=$S, k=$K);
+B = smote(X = A, s=$S, k=$K, verbose=TRUE);
# test if all point fall in same cluster (closed to each other)
# read some new data T != A