You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hivemall.apache.org by my...@apache.org on 2019/04/12 06:38:59 UTC
[incubator-hivemall] branch master updated: [HIVEMALL-250][DOC] Add tutorial for binarize_label

This is an automated email from the ASF dual-hosted git repository.

myui pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hivemall.git


The following commit(s) were added to refs/heads/master by this push:
     new b187b10  [HIVEMALL-250][DOC] Add tutorial for binarize_label
b187b10 is described below

commit b187b10c0cbcfd4809cabc1cddcb68190109d839
Author: Makoto Yui <my...@apache.org>
AuthorDate: Fri Apr 12 15:38:53 2019 +0900

    [HIVEMALL-250][DOC] Add tutorial for binarize_label
    
    ## What changes were proposed in this pull request?
    
    Add tutorial for `binarize_label` UDTF
    
    ## What type of PR is it?
    
    Documentation
    
    ## What is the Jira issue?
    
    https://issues.apache.org/jira/browse/HIVEMALL-250
    
    ## How to use this feature?
    
    as described in tutorial
    
    Author: Makoto Yui <my...@apache.org>
    
    Closes #187 from myui/HIVEMALL-250.
---
 docs/gitbook/SUMMARY.md                 |  1 +
 docs/gitbook/ft_engineering/binarize.md | 52 +++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md
index 7ead819..02fc97e 100644
--- a/docs/gitbook/SUMMARY.md
+++ b/docs/gitbook/SUMMARY.md
@@ -65,6 +65,7 @@
 * [Feature Transformation](ft_engineering/ft_trans.md)
     * [Feature vectorization](ft_engineering/vectorization.md)
     * [Quantify non-number features](ft_engineering/quantify.md)
+    * [Binarize label](ft_engineering/binarize.md)
 * [Term Vector Model](ft_engineering/term_vector.md)
     * [TF-IDF Term Weighting](ft_engineering/tfidf.md)
     * [Okapi BM25 Term Weighting](ft_engineering/bm25.md)
diff --git a/docs/gitbook/ft_engineering/binarize.md b/docs/gitbook/ft_engineering/binarize.md
new file mode 100644
index 0000000..f237eb3
--- /dev/null
+++ b/docs/gitbook/ft_engineering/binarize.md
@@ -0,0 +1,52 @@
+## Introduction
+
+Expanding numeric labels to actual count of samples can contribute to accuracy improvement in some cases. `binarize_label` explode a record that keeps the count of positive/negative labeled samples into corresponding actual count of samples. For example,
+
+|positive|negative|features|
+|:----|:----|:----|
+|2|3|"[a:1, b:2]"|
+
+is converted into 
+
+|features|label|
+|:----|:----|
+|"[a:1, b:2]"|0|
+|"[a:1, b:2]"|0|
+|"[a:1, b:2]"|1|
+|"[a:1, b:2]"|1|
+|"[a:1, b:2]"|1|
+
+## Function signature
+
+`binarize_label(int/long positive, int/long negative, ANY arg1, ANY arg2, ..., ANY argN)` 
+returns (ANY arg1, ANY arg2, ..., ANY argN, int label) where label is 0 or 1.
+
+## Usage
+
+```sql
+WITH input as (
+  select 2 as positive, 3 as negative, array('a:1','b:2') as features
+  UNION ALL
+  select 2 as positive, 1 as negative, array('c:3','d:4') as features
+)
+SELECT
+  binarize_label(positive, negative, features)
+from 
+  input;
+```
+
+|features|label|
+|:----|:----|
+| ["a:1","b:2"]  | 1 |
+| ["a:1","b:2"]  | 1 |
+| ["a:1","b:2"]  | 0 |
+| ["a:1","b:2"]  | 0 |
+| ["a:1","b:2"]  | 0 |
+| ["c:3","d:4"]  | 1 |
+| ["c:3","d:4"]  | 1 |
+| ["c:3","d:4"]  | 0 |
+
+
+> #### Caution
+>
+> Don't forget to shuffle converted training instances in a random order, e.g., by `CLUSTER BY rand()`.