You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemds.apache.org by ja...@apache.org on 2021/05/06 20:56:17 UTC

[systemds] branch master updated: [SYSTEMDS-2877][DOC] Tomeklink builtin function (#1264)

This is an automated email from the ASF dual-hosted git repository.

janardhan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/systemds.git


The following commit(s) were added to refs/heads/master by this push:
     new 97a8f1b  [SYSTEMDS-2877][DOC] Tomeklink builtin function (#1264)
97a8f1b is described below

commit 97a8f1b4a46f9264c24791dac66a26c5f97c44f3
Author: j143 <j1...@protonmail.com>
AuthorDate: Fri May 7 02:26:06 2021 +0530

    [SYSTEMDS-2877][DOC] Tomeklink builtin function (#1264)
    
    Background:
    Due to the accuracy oriented design of the classifiers the performance is
    hindered with imbalance, which usually overlooks minority class.
    
    "Any dataset with an unequal class distribution is technically imbalanced...
    However a significant ... disproportion among the number of examples of each
    class of the problem."
    
    Page 19, section 2.1
    source: https://www.springer.com/gp/book/9783319980737
    
    Similar function implemented elsewhere:
    Scikit learn -
    http://glemaitre.github.io/imbalanced-learn/generated/imblearn.under_sampling.TomekLinks.html
    R docs - https://www.rdocumentation.org/packages/UBL/versions/0.0.6/topics/TomekClassif
---
 docs/site/builtins-reference.md | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/docs/site/builtins-reference.md b/docs/site/builtins-reference.md
index 7e9a083..784b067 100644
--- a/docs/site/builtins-reference.md
+++ b/docs/site/builtins-reference.md
@@ -66,6 +66,7 @@ limitations under the License.
     * [`naiveBayes`-Function](#naiveBayes-function)
     * [`naiveBayesPredict`-Function](#naiveBayesPredict-function)
     * [`outlier`-Function](#outlier-function)
+    * [`tomekLink`-Function](#tomekLink-function)
     * [`toOneHot`-Function](#toOneHOt-function)
     * [`winsorize`-Function](#winsorize-function)
     * [`gmm`-Function](#gmm-function)
@@ -1530,6 +1531,43 @@ X = rand (rows = 50, cols = 10)
 outlier(X=X, opposite=1)
 ```
 
+## `tomekLink`-Function
+
+The `tomekLink`-function performs undersampling by removing Tomek's links for imbalanced
+multiclass problems
+
+Reference:
+"Two Modifications of CNN," in IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-6, no. 11, pp. 769-772, Nov. 1976, doi: 10.1109/TSMC.1976.4309452.
+
+### Usage
+
+```r
+[X_under, y_under, drop_idx] = tomeklink(X, y)
+```
+
+### Arguments
+
+| Name       | Type           | Default  | Description |
+| :--------- | :------------- | -------- | :---------- |
+| X          | Matrix[Double] | required | Data Matrix (n,m) |
+| y          | Matrix[Double] | required | Label Matrix (n,1) |
+
+### Returns
+
+| Name | Type           | Description |
+| :--- |:------------- | :---------- |
+| X_under | Matrix[Double] | Data Matrix without Tomek links |
+| y_under | Matrix[Double] | Labels corresponding to undersampled data |
+| drop_idx | Matrix[Double] | Indices of dropped rows/labels wrt. input |
+
+### Example
+
+```r
+X = round(rand(rows = 53, cols = 6, min = -1, max = 1))
+y = round(rand(rows = nrow(X), cols = 1, min = 0, max = 1))
+[X_under, y_under, drop_idx] = tomeklink(X, y)
+```
+
 ## `toOneHot`-Function
 
 The `toOneHot`-function encodes unordered categorical vector to multiple binarized vectors.