You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hivemall.apache.org by my...@apache.org on 2019/09/27 19:04:54 UTC
[incubator-hivemall-site] 01/04: Update entry about feature binning
This is an automated email from the ASF dual-hosted git repository.
myui pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hivemall-site.git
commit 26f41edc32f58b335f2798bbbca1237b41de893a
Author: Makoto Yui <my...@apache.org>
AuthorDate: Sat Jun 29 01:28:27 2019 +0900
Update entry about feature binning
---
userguide/ft_engineering/binning.html | 233 ++++++++++++++++++++++++----------
userguide/misc/funcs.html | 37 +++++-
userguide/misc/generic_funcs.html | 2 +-
3 files changed, 204 insertions(+), 68 deletions(-)
diff --git a/userguide/ft_engineering/binning.html b/userguide/ft_engineering/binning.html
index 5d75620..1d4f235 100644
--- a/userguide/ft_engineering/binning.html
+++ b/userguide/ft_engineering/binning.html
@@ -2377,28 +2377,21 @@
specific language governing permissions and limitations
under the License.
-->
-<p>Feature binning is a method of dividing quantitative variables into categorical values.
-It groups quantitative values into a pre-defined number of bins.</p>
-<p><em>Note: This feature is supported from Hivemall v0.5-rc.1 or later.</em></p>
+<p>Feature binning is a method of dividing quantitative variables into categorical values. It groups quantitative values into a pre-defined number of bins.</p>
+<p>If the number of bins is set to 3, the bin ranges become something like <code>[-Inf, 1], (1, 10], (10, Inf]</code>.</p>
<!-- toc --><div id="toc" class="toc">
<ul>
<li><a href="#usage">Usage</a><ul>
-<li><a href="#a-feature-vector-trasformation-by-applying-feature-binning">A. Feature Vector trasformation by applying Feature Binning</a></li>
-<li><a href="#b-get-a-mapping-table-by-feature-binning">B. Get a mapping table by Feature Binning</a></li>
-</ul>
-</li>
-<li><a href="#function-signature">Function Signature</a><ul>
-<li><a href="#udaf-buildbinsweight-numofbins-autoshrink">[UDAF] <code>build_bins(weight, num_of_bins[, auto_shrink])</code></a><ul>
-<li><a href="#input">Input</a></li>
-<li><a href="#output">Output</a></li>
-</ul>
-</li>
-<li><a href="#udf-featurebinningfeatures-quantilesmapweight-quantiles">[UDF] <code>feature_binning(features, quantiles_map)/(weight, quantiles)</code></a><ul>
-<li><a href="#variation-a">Variation: A</a></li>
-<li><a href="#variation-b">Variation: B</a></li>
+<li><a href="#feature-vector-trasformation-by-applying-feature-binning">Feature Vector trasformation by applying Feature Binning</a></li>
+<li><a href="#practical-example">Practical Example</a></li>
+<li><a href="#get-a-mapping-table-by-feature-binning">Get a mapping table by Feature Binning</a></li>
</ul>
</li>
+<li><a href="#function-signatures">Function Signatures</a><ul>
+<li><a href="#udaf-buildbinsweight-numofbins--autoshrinkfalse">UDAF <code>build_bins(weight num_of_bins [, auto_shrink=false])</code></a></li>
+<li><a href="#udf-featurebinningfeatures-quantilesmap">UDF <code>feature_binning(features, quantiles_map)</code></a></li>
+<li><a href="#udf-featurebinningweight-quantiles">UDF <code>feature_binning(weight, quantiles)</code></a></li>
</ul>
</li>
</ul>
@@ -2407,35 +2400,96 @@ It groups quantitative values into a pre-defined number of bins.</p>
<h1 id="usage">Usage</h1>
<p>Prepare sample data (<em>users</em> table) first as follows:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">users</span> (
- <span class="hljs-keyword">name</span> <span class="hljs-keyword">string</span>, age <span class="hljs-built_in">int</span>, gender <span class="hljs-keyword">string</span>
+ <span class="hljs-keyword">rowid</span> <span class="hljs-built_in">int</span>, <span class="hljs-keyword">name</span> <span class="hljs-keyword">string</span>, age <span class="hljs-built_in">int</span>, gender <span class="hljs-keyword">string</span>
);
-
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> <span class="hljs-keyword">users</span> <span class="hljs-keyword">VALUES</span>
- (<span class="hljs-string">'Jacob'</span>, <span class="hljs-number">20</span>, <span class="hljs-string">'Male'</span>),
- (<span class="hljs-string">'Mason'</span>, <span class="hljs-number">22</span>, <span class="hljs-string">'Male'</span>),
- (<span class="hljs-string">'Sophia'</span>, <span class="hljs-number">35</span>, <span class="hljs-string">'Female'</span>),
- (<span class="hljs-string">'Ethan'</span>, <span class="hljs-number">55</span>, <span class="hljs-string">'Male'</span>),
- (<span class="hljs-string">'Emma'</span>, <span class="hljs-number">15</span>, <span class="hljs-string">'Female'</span>),
- (<span class="hljs-string">'Noah'</span>, <span class="hljs-number">46</span>, <span class="hljs-string">'Male'</span>),
- (<span class="hljs-string">'Isabella'</span>, <span class="hljs-number">20</span>, <span class="hljs-string">'Female'</span>);
+ (<span class="hljs-number">1</span>, <span class="hljs-string">'Jacob'</span>, <span class="hljs-number">20</span>, <span class="hljs-string">'Male'</span>),
+ (<span class="hljs-number">2</span>, <span class="hljs-string">'Mason'</span>, <span class="hljs-number">22</span>, <span class="hljs-string">'Male'</span>),
+ (<span class="hljs-number">3</span>, <span class="hljs-string">'Sophia'</span>, <span class="hljs-number">35</span>, <span class="hljs-string">'Female'</span>),
+ (<span class="hljs-number">4</span>, <span class="hljs-string">'Ethan'</span>, <span class="hljs-number">55</span>, <span class="hljs-string">'Male'</span>),
+ (<span class="hljs-number">5</span>, <span class="hljs-string">'Emma'</span>, <span class="hljs-number">15</span>, <span class="hljs-string">'Female'</span>),
+ (<span class="hljs-number">6</span>, <span class="hljs-string">'Noah'</span>, <span class="hljs-number">46</span>, <span class="hljs-string">'Male'</span>),
+ (<span class="hljs-number">7</span>, <span class="hljs-string">'Isabella'</span>, <span class="hljs-number">20</span>, <span class="hljs-string">'Female'</span>)
+;
+
+<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">input</span> <span class="hljs-keyword">as</span>
+<span class="hljs-keyword">SELECT</span>
+ <span class="hljs-keyword">rowid</span>,
+ array_concat(
+ categorical_features(
+ <span class="hljs-built_in">array</span>(<span class="hljs-string">'name'</span>, <span class="hljs-string">'gender'</span>),
+ <span class="hljs-keyword">name</span>, gender
+ ),
+ quantitative_features(
+ <span class="hljs-built_in">array</span>(<span class="hljs-string">'age'</span>),
+ age
+ )
+ ) <span class="hljs-keyword">AS</span> features
+<span class="hljs-keyword">FROM</span>
+ <span class="hljs-keyword">users</span>;
+
+<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> <span class="hljs-keyword">input</span> <span class="hljs-keyword">limit</span> <span class="hljs-number">2</span>;
</code></pre>
-<h2 id="a-feature-vector-trasformation-by-applying-feature-binning">A. Feature Vector trasformation by applying Feature Binning</h2>
-<pre><code class="lang-sql">WITH t AS (
+<table>
+<thead>
+<tr>
+<th style="text-align:left">input.rowid</th>
+<th style="text-align:left">input.features</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td style="text-align:left">1</td>
+<td style="text-align:left">["name#Jacob","gender#Male","age:20.0"]</td>
+</tr>
+<tr>
+<td style="text-align:left">2</td>
+<td style="text-align:left">["name#Mason","gender#Male","age:22.0"]</td>
+</tr>
+</tbody>
+</table>
+<h2 id="feature-vector-trasformation-by-applying-feature-binning">Feature Vector trasformation by applying Feature Binning</h2>
+<p>Now, converting <code>age</code> values into 3 bins.</p>
+<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
+ <span class="hljs-keyword">map</span>(<span class="hljs-string">'age'</span>, build_bins(age, <span class="hljs-number">3</span>)) <span class="hljs-keyword">AS</span> quantiles_map
+<span class="hljs-keyword">FROM</span>
+ <span class="hljs-keyword">users</span>
+</code></pre>
+<blockquote>
+<p>{"age":[-Infinity,18.333333333333332,30.666666666666657,Infinity]}</p>
+</blockquote>
+<p>In the above query result, you can find 4 values for age in <code>quantiles_map</code>. It's a threshold of 3 bins. </p>
+<pre><code class="lang-sql">WITH bins as (
<span class="hljs-keyword">SELECT</span>
- array_concat(
- categorical_features(
- <span class="hljs-built_in">array</span>(<span class="hljs-string">'name'</span>, <span class="hljs-string">'gender'</span>),
- <span class="hljs-keyword">name</span>, gender
- ),
- quantitative_features(
- <span class="hljs-built_in">array</span>(<span class="hljs-string">'age'</span>),
- age
- )
- ) <span class="hljs-keyword">AS</span> features
+ <span class="hljs-keyword">map</span>(<span class="hljs-string">'age'</span>, build_bins(age, <span class="hljs-number">3</span>)) <span class="hljs-keyword">AS</span> quantiles_map
<span class="hljs-keyword">FROM</span>
<span class="hljs-keyword">users</span>
-),
-bins <span class="hljs-keyword">AS</span> (
+)
+<span class="hljs-keyword">select</span>
+ feature_binning(
+ <span class="hljs-built_in">array</span>(<span class="hljs-string">'age:-Infinity'</span>, <span class="hljs-string">'age:-1'</span>, <span class="hljs-string">'age:0'</span>, <span class="hljs-string">'age:1'</span>, <span class="hljs-string">'age:18.333333333333331'</span>, <span class="hljs-string">'age:18.333333333333332'</span>), quantiles_map
+ ),
+ feature_binning(
+ <span class="hljs-built_in">array</span>(<span class="hljs-string">'age:18.3333333333333333'</span>, <span class="hljs-string">'age:18.33333333333334'</span>, <span class="hljs-string">'age:19'</span>, <span class="hljs-string">'age:30'</span>, <span class="hljs-string">'age:30.666666666666656'</span>, <span class="hljs-string">'age:30.666666666666657'</span>), quantiles_map
+ ),
+ feature_binning(
+ <span class="hljs-built_in">array</span>(<span class="hljs-string">'age:666666666666658'</span>, <span class="hljs-string">'age:30.66666666666666'</span>, <span class="hljs-string">'age:31'</span>, <span class="hljs-string">'age:99'</span>, <span class="hljs-string">'age:Infinity'</span>), quantiles_map
+ ),
+ feature_binning(
+ <span class="hljs-built_in">array</span>(<span class="hljs-string">'age:NaN'</span>), quantiles_map
+ ),
+ feature_binning( <span class="hljs-comment">-- not in map</span>
+ <span class="hljs-built_in">array</span>(<span class="hljs-string">'weight:60.3'</span>), quantiles_map
+ )
+<span class="hljs-keyword">from</span>
+ bins
+</code></pre>
+<blockquote>
+<p>["age:0","age:0","age:0","age:0","age:0","age:0"] ["age:0","age:1","age:1","age:1","age:1","age:1"] ["age:2","a
+ge:2","age:2","age:2","age:2"] ["age:3"] ["weight:60.3"]</p>
+</blockquote>
+<p>The following query shows more practical usage:</p>
+<pre><code class="lang-sql">WITH bins AS (
<span class="hljs-keyword">SELECT</span>
<span class="hljs-keyword">map</span>(<span class="hljs-string">'age'</span>, build_bins(age, <span class="hljs-number">3</span>)) <span class="hljs-keyword">AS</span> quantiles_map
<span class="hljs-keyword">FROM</span>
@@ -2444,40 +2498,91 @@ bins <span class="hljs-keyword">AS</span> (
<span class="hljs-keyword">SELECT</span>
feature_binning(features, quantiles_map) <span class="hljs-keyword">AS</span> features
<span class="hljs-keyword">FROM</span>
- t <span class="hljs-keyword">CROSS</span> <span class="hljs-keyword">JOIN</span> bins;
+ <span class="hljs-keyword">input</span>
+ <span class="hljs-keyword">CROSS</span> <span class="hljs-keyword">JOIN</span> bins;
</code></pre>
-<p><em>Result</em></p>
<table>
<thead>
<tr>
-<th style="text-align:center">features: <code>array<features::string></code></th>
+<th style="text-align:left">features: <code>array<features::string></code></th>
</tr>
</thead>
<tbody>
<tr>
-<td style="text-align:center">["name#Jacob","gender#Male","age:1"]</td>
+<td style="text-align:left">["name#Jacob","gender#Male","age:1"]</td>
</tr>
<tr>
-<td style="text-align:center">["name#Mason","gender#Male","age:1"]</td>
+<td style="text-align:left">["name#Mason","gender#Male","age:1"]</td>
</tr>
<tr>
-<td style="text-align:center">["name#Sophia","gender#Female","age:2"]</td>
+<td style="text-align:left">["name#Sophia","gender#Female","age:2"]</td>
</tr>
<tr>
-<td style="text-align:center">["name#Ethan","gender#Male","age:2"]</td>
+<td style="text-align:left">["name#Ethan","gender#Male","age:2"]</td>
+</tr>
+<tr>
+<td style="text-align:left">...</td>
+</tr>
+</tbody>
+</table>
+<h2 id="practical-example">Practical Example</h2>
+<p>Here, we show a more practical usage of <code>feature_binning</code> UDF that applied feature binning for given feature vectors.</p>
+<pre><code class="lang-sql">WITH extracted as (
+ <span class="hljs-keyword">select</span>
+ extract_feature(feature) <span class="hljs-keyword">as</span> <span class="hljs-keyword">index</span>,
+ extract_weight(feature) <span class="hljs-keyword">as</span> <span class="hljs-keyword">value</span>
+ <span class="hljs-keyword">from</span>
+ <span class="hljs-keyword">input</span> l
+ LATERAL <span class="hljs-keyword">VIEW</span> explode(features) r <span class="hljs-keyword">as</span> feature
+ <span class="hljs-keyword">where</span>
+ <span class="hljs-keyword">instr</span>(feature, <span class="hljs-string">':'</span>) > <span class="hljs-number">0</span> <span class="hljs-comment">-- filter out categorical features</span>
+),
+<span class="hljs-keyword">mapping</span> <span class="hljs-keyword">as</span> (
+ <span class="hljs-keyword">select</span>
+ <span class="hljs-keyword">index</span>,
+ build_bins(<span class="hljs-keyword">value</span>, <span class="hljs-number">5</span>, <span class="hljs-literal">true</span>) <span class="hljs-keyword">as</span> quantiles <span class="hljs-comment">-- 5 bins with auto bin shrinking</span>
+ <span class="hljs-keyword">from</span>
+ extracted
+ <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span>
+ <span class="hljs-keyword">index</span>
+),
+bins <span class="hljs-keyword">as</span> (
+ <span class="hljs-keyword">select</span>
+ to_map(<span class="hljs-keyword">index</span>, quantiles) <span class="hljs-keyword">as</span> quantiles
+ <span class="hljs-keyword">from</span>
+ <span class="hljs-keyword">mapping</span>
+)
+<span class="hljs-keyword">select</span>
+ l.features <span class="hljs-keyword">as</span> original,
+ feature_binning(l.features, r.quantiles) <span class="hljs-keyword">as</span> features
+<span class="hljs-keyword">from</span>
+ <span class="hljs-keyword">input</span> l
+ <span class="hljs-keyword">cross</span> <span class="hljs-keyword">join</span> bins r
+<span class="hljs-comment">-- limit 10;</span>
+</code></pre>
+<table>
+<thead>
+<tr>
+<th style="text-align:left">original</th>
+<th style="text-align:left">features</th>
</tr>
+</thead>
+<tbody>
<tr>
-<td style="text-align:center">["name#Emma","gender#Female","age:0"]</td>
+<td style="text-align:left">["name#Jacob","gender#Male","age:20.0"]</td>
+<td style="text-align:left">["name#Jacob","gender#Male","age:2"]</td>
</tr>
<tr>
-<td style="text-align:center">["name#Noah","gender#Male","age:2"]</td>
+<td style="text-align:left">["name#Isabella","gender#Female","age:20.0"]</td>
+<td style="text-align:left">["name#Isabella","gender#Female","age:2"]</td>
</tr>
<tr>
-<td style="text-align:center">["name#Isabella","gender#Female","age:1"]</td>
+<td style="text-align:left">...</td>
+<td style="text-align:left">...</td>
</tr>
</tbody>
</table>
-<h2 id="b-get-a-mapping-table-by-feature-binning">B. Get a mapping table by Feature Binning</h2>
+<h2 id="get-a-mapping-table-by-feature-binning">Get a mapping table by Feature Binning</h2>
<pre><code class="lang-sql">WITH bins AS (
<span class="hljs-keyword">SELECT</span> build_bins(age, <span class="hljs-number">3</span>) <span class="hljs-keyword">AS</span> quantiles
<span class="hljs-keyword">FROM</span> <span class="hljs-keyword">users</span>
@@ -2487,7 +2592,6 @@ bins <span class="hljs-keyword">AS</span> (
<span class="hljs-keyword">FROM</span>
<span class="hljs-keyword">users</span> <span class="hljs-keyword">CROSS</span> <span class="hljs-keyword">JOIN</span> bins;
</code></pre>
-<p><em>Result</em></p>
<table>
<thead>
<tr>
@@ -2526,9 +2630,9 @@ bins <span class="hljs-keyword">AS</span> (
</tr>
</tbody>
</table>
-<h1 id="function-signature">Function Signature</h1>
-<h2 id="udaf-buildbinsweight-numofbins-autoshrink">[UDAF] <code>build_bins(weight, num_of_bins[, auto_shrink])</code></h2>
-<h3 id="input">Input</h3>
+<h1 id="function-signatures">Function Signatures</h1>
+<h3 id="udaf-buildbinsweight-numofbins--autoshrinkfalse">UDAF <code>build_bins(weight num_of_bins [, auto_shrink=false])</code></h3>
+<h4 id="input">Input</h4>
<table>
<thead>
<tr>
@@ -2540,12 +2644,12 @@ bins <span class="hljs-keyword">AS</span> (
<tbody>
<tr>
<td style="text-align:center">weight</td>
-<td style="text-align:center">2 <=</td>
+<td style="text-align:center">greather than or equals to 2</td>
<td style="text-align:center">behavior when separations are repeated: T=>skip, F=>exception</td>
</tr>
</tbody>
</table>
-<h3 id="output">Output</h3>
+<h4 id="output">Output</h4>
<table>
<thead>
<tr>
@@ -2554,14 +2658,13 @@ bins <span class="hljs-keyword">AS</span> (
</thead>
<tbody>
<tr>
-<td style="text-align:center">array of separation value</td>
+<td style="text-align:center">thresholds of bins based on quantiles</td>
</tr>
</tbody>
</table>
<div class="panel panel-primary"><div class="panel-heading"><h3 class="panel-title" id="note"><i class="fa fa-edit"></i> Note</h3></div><div class="panel-body"><p>There is the possibility quantiles are repeated because of too many <code>num_of_bins</code> or too few data.
-If <code>auto_shrink</code> is true, skip duplicated quantiles. If not, throw an exception.</p></div></div>
-<h2 id="udf-featurebinningfeatures-quantilesmapweight-quantiles">[UDF] <code>feature_binning(features, quantiles_map)/(weight, quantiles)</code></h2>
-<h3 id="variation-a">Variation: A</h3>
+If <code>auto_shrink</code> is set to true, skip duplicated quantiles. If not, throw an exception.</p></div></div>
+<h3 id="udf-featurebinningfeatures-quantilesmap">UDF <code>feature_binning(features, quantiles_map)</code></h3>
<h4 id="input">Input</h4>
<table>
<thead>
@@ -2572,8 +2675,8 @@ If <code>auto_shrink</code> is true, skip duplicated quantiles. If not, throw an
</thead>
<tbody>
<tr>
-<td style="text-align:center">serialized feature</td>
-<td style="text-align:center">entry:: key: col name, val: quantiles</td>
+<td style="text-align:center">feature vector</td>
+<td style="text-align:center">a map where key=column name and value=quantiles</td>
</tr>
</tbody>
</table>
@@ -2586,11 +2689,11 @@ If <code>auto_shrink</code> is true, skip duplicated quantiles. If not, throw an
</thead>
<tbody>
<tr>
-<td style="text-align:center">serialized and binned features</td>
+<td style="text-align:center">binned features</td>
</tr>
</tbody>
</table>
-<h3 id="variation-b">Variation: B</h3>
+<h3 id="udf-featurebinningweight-quantiles">UDF <code>feature_binning(weight, quantiles)</code></h3>
<h4 id="input">Input</h4>
<table>
<thead>
@@ -2674,7 +2777,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda
<script>
var gitbook = gitbook || [];
gitbook.push(function() {
- gitbook.page.hasChanged({"page":{"title":"Feature Binning","level":"3.4","depth":1,"next":{"title":"Feature Paring","level":"3.5","depth":1,"path":"ft_engineering/pairing.md","ref":"ft_engineering/pairing.md","articles":[{"title":"Polynomial features","level":"3.5.1","depth":2,"path":"ft_engineering/polynomial.md","ref":"ft_engineering/polynomial.md","articles":[]}]},"previous":{"title":"Feature Selection","level":"3.3","depth":1,"path":"ft_engineering/selection.md","ref":"ft [...]
+ gitbook.page.hasChanged({"page":{"title":"Feature Binning","level":"3.4","depth":1,"next":{"title":"Feature Paring","level":"3.5","depth":1,"path":"ft_engineering/pairing.md","ref":"ft_engineering/pairing.md","articles":[{"title":"Polynomial features","level":"3.5.1","depth":2,"path":"ft_engineering/polynomial.md","ref":"ft_engineering/polynomial.md","articles":[]}]},"previous":{"title":"Feature Selection","level":"3.3","depth":1,"path":"ft_engineering/selection.md","ref":"ft [...]
});
</script>
</div>
diff --git a/userguide/misc/funcs.html b/userguide/misc/funcs.html
index a77222d..74adf17 100644
--- a/userguide/misc/funcs.html
+++ b/userguide/misc/funcs.html
@@ -2628,7 +2628,40 @@ Reference: <a href="https://papers.nips.cc/paper/3848-adaptive-regularization-of
<ul>
<li><p><code>build_bins(number weight, const int num_of_bins[, const boolean auto_shrink = false])</code> - Return quantiles representing bins: array<double></p>
</li>
-<li><p><code>feature_binning(array<features::string> features, const map<string, array<number>> quantiles_map)</code> / <em>FUNC</em>(number weight, const array<number> quantiles) - Returns binned features as an array<features::string> / bin ID as int</p>
+<li><p><code>feature_binning(array<features::string> features, map<string, array<number>> quantiles_map)</code> - returns a binned feature vector as an array<features::string> <em>FUNC</em>(number weight, array<number> quantiles) - returns bin ID as int</p>
+<pre><code class="lang-sql">WITH extracted as (
+ <span class="hljs-keyword">select</span>
+ extract_feature(feature) <span class="hljs-keyword">as</span> <span class="hljs-keyword">index</span>,
+ extract_weight(feature) <span class="hljs-keyword">as</span> <span class="hljs-keyword">value</span>
+ <span class="hljs-keyword">from</span>
+ <span class="hljs-keyword">input</span> l
+ LATERAL <span class="hljs-keyword">VIEW</span> explode(features) r <span class="hljs-keyword">as</span> feature
+),
+<span class="hljs-keyword">mapping</span> <span class="hljs-keyword">as</span> (
+ <span class="hljs-keyword">select</span>
+ <span class="hljs-keyword">index</span>,
+ build_bins(<span class="hljs-keyword">value</span>, <span class="hljs-number">5</span>, <span class="hljs-literal">true</span>) <span class="hljs-keyword">as</span> quantiles <span class="hljs-comment">-- 5 bins with auto bin shrinking</span>
+ <span class="hljs-keyword">from</span>
+ extracted
+ <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span>
+ <span class="hljs-keyword">index</span>
+),
+bins <span class="hljs-keyword">as</span> (
+ <span class="hljs-keyword">select</span>
+ to_map(<span class="hljs-keyword">index</span>, quantiles) <span class="hljs-keyword">as</span> quantiles
+ <span class="hljs-keyword">from</span>
+ <span class="hljs-keyword">mapping</span>
+)
+<span class="hljs-keyword">select</span>
+ l.features <span class="hljs-keyword">as</span> original,
+ feature_binning(l.features, r.quantiles) <span class="hljs-keyword">as</span> features
+<span class="hljs-keyword">from</span>
+ <span class="hljs-keyword">input</span> l
+ <span class="hljs-keyword">cross</span> <span class="hljs-keyword">join</span> bins r
+
+> [<span class="hljs-string">"name#Jacob"</span>,<span class="hljs-string">"gender#Male"</span>,<span class="hljs-string">"age:20.0"</span>] [<span class="hljs-string">"name#Jacob"</span>,<span class="hljs-string">"gender#Male"</span>,<span class="hljs-string">"age:2"</span>]
+> [<span class="hljs-string">"name#Isabella"</span>,<span class="hljs-string">"gender#Female"</span>,<span class="hljs-string">"age:20.0"</span>] [<span class="hljs-string">"name#Isabella"</span>,<span class="hljs-string">"gender#Female"</span>,<span class="hljs-string">"age:2"</span>]
+</code></pre>
</li>
</ul>
<h2 id="feature-format-conversion">Feature format conversion</h2>
@@ -3024,7 +3057,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda
<script>
var gitbook = gitbook || [];
gitbook.push(function() {
- gitbook.page.hasChanged({"page":{"title":"List of Functions","level":"1.3","depth":1,"next":{"title":"Tips for Effective Hivemall","level":"1.4","depth":1,"path":"tips/README.md","ref":"tips/README.md","articles":[{"title":"Explicit add_bias() for better prediction","level":"1.4.1","depth":2,"path":"tips/addbias.md","ref":"tips/addbias.md","articles":[]},{"title":"Use rand_amplify() to better prediction results","level":"1.4.2","depth":2,"path":"tips/rand_amplify.md","ref":"t [...]
+ gitbook.page.hasChanged({"page":{"title":"List of Functions","level":"1.3","depth":1,"next":{"title":"Tips for Effective Hivemall","level":"1.4","depth":1,"path":"tips/README.md","ref":"tips/README.md","articles":[{"title":"Explicit add_bias() for better prediction","level":"1.4.1","depth":2,"path":"tips/addbias.md","ref":"tips/addbias.md","articles":[]},{"title":"Use rand_amplify() to better prediction results","level":"1.4.2","depth":2,"path":"tips/rand_amplify.md","ref":"t [...]
});
</script>
</div>
diff --git a/userguide/misc/generic_funcs.html b/userguide/misc/generic_funcs.html
index a5fbe95..8246823 100644
--- a/userguide/misc/generic_funcs.html
+++ b/userguide/misc/generic_funcs.html
@@ -3183,7 +3183,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda
<script>
var gitbook = gitbook || [];
gitbook.push(function() {
- gitbook.page.hasChanged({"page":{"title":"List of Generic Hivemall Functions","level":"2.1","depth":1,"next":{"title":"Efficient Top-K Query Processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"previous":{"title":"Map-side join causes ClassCastException on Tez","level":"1.6.5","depth":2,"path":"troubleshooting/mapjoin_classcastex.md","ref":"troubleshooting/mapjoin_classcastex.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme [...]
+ gitbook.page.hasChanged({"page":{"title":"List of Generic Hivemall Functions","level":"2.1","depth":1,"next":{"title":"Efficient Top-K Query Processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"previous":{"title":"Map-side join causes ClassCastException on Tez","level":"1.6.5","depth":2,"path":"troubleshooting/mapjoin_classcastex.md","ref":"troubleshooting/mapjoin_classcastex.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme [...]
});
</script>
</div>