You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hivemall.apache.org by my...@apache.org on 2019/04/12 06:38:59 UTC
[incubator-hivemall] branch master updated: [HIVEMALL-250][DOC] Add
tutorial for binarize_label
This is an automated email from the ASF dual-hosted git repository.
myui pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hivemall.git
The following commit(s) were added to refs/heads/master by this push:
new b187b10 [HIVEMALL-250][DOC] Add tutorial for binarize_label
b187b10 is described below
commit b187b10c0cbcfd4809cabc1cddcb68190109d839
Author: Makoto Yui <my...@apache.org>
AuthorDate: Fri Apr 12 15:38:53 2019 +0900
[HIVEMALL-250][DOC] Add tutorial for binarize_label
## What changes were proposed in this pull request?
Add tutorial for `binarize_label` UDTF
## What type of PR is it?
Documentation
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-250
## How to use this feature?
as described in tutorial
Author: Makoto Yui <my...@apache.org>
Closes #187 from myui/HIVEMALL-250.
---
docs/gitbook/SUMMARY.md | 1 +
docs/gitbook/ft_engineering/binarize.md | 52 +++++++++++++++++++++++++++++++++
2 files changed, 53 insertions(+)
diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md
index 7ead819..02fc97e 100644
--- a/docs/gitbook/SUMMARY.md
+++ b/docs/gitbook/SUMMARY.md
@@ -65,6 +65,7 @@
* [Feature Transformation](ft_engineering/ft_trans.md)
* [Feature vectorization](ft_engineering/vectorization.md)
* [Quantify non-number features](ft_engineering/quantify.md)
+ * [Binarize label](ft_engineering/binarize.md)
* [Term Vector Model](ft_engineering/term_vector.md)
* [TF-IDF Term Weighting](ft_engineering/tfidf.md)
* [Okapi BM25 Term Weighting](ft_engineering/bm25.md)
diff --git a/docs/gitbook/ft_engineering/binarize.md b/docs/gitbook/ft_engineering/binarize.md
new file mode 100644
index 0000000..f237eb3
--- /dev/null
+++ b/docs/gitbook/ft_engineering/binarize.md
@@ -0,0 +1,52 @@
+## Introduction
+
+Expanding numeric labels to actual count of samples can contribute to accuracy improvement in some cases. `binarize_label` explode a record that keeps the count of positive/negative labeled samples into corresponding actual count of samples. For example,
+
+|positive|negative|features|
+|:----|:----|:----|
+|2|3|"[a:1, b:2]"|
+
+is converted into
+
+|features|label|
+|:----|:----|
+|"[a:1, b:2]"|0|
+|"[a:1, b:2]"|0|
+|"[a:1, b:2]"|1|
+|"[a:1, b:2]"|1|
+|"[a:1, b:2]"|1|
+
+## Function signature
+
+`binarize_label(int/long positive, int/long negative, ANY arg1, ANY arg2, ..., ANY argN)`
+returns (ANY arg1, ANY arg2, ..., ANY argN, int label) where label is 0 or 1.
+
+## Usage
+
+```sql
+WITH input as (
+ select 2 as positive, 3 as negative, array('a:1','b:2') as features
+ UNION ALL
+ select 2 as positive, 1 as negative, array('c:3','d:4') as features
+)
+SELECT
+ binarize_label(positive, negative, features)
+from
+ input;
+```
+
+|features|label|
+|:----|:----|
+| ["a:1","b:2"] | 1 |
+| ["a:1","b:2"] | 1 |
+| ["a:1","b:2"] | 0 |
+| ["a:1","b:2"] | 0 |
+| ["a:1","b:2"] | 0 |
+| ["c:3","d:4"] | 1 |
+| ["c:3","d:4"] | 1 |
+| ["c:3","d:4"] | 0 |
+
+
+> #### Caution
+>
+> Don't forget to shuffle converted training instances in a random order, e.g., by `CLUSTER BY rand()`.