You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hivemall.apache.org by my...@apache.org on 2017/05/08 08:39:59 UTC
incubator-hivemall git commit: Close #77: [HIVEMALL-98] Feature
binning documents
Repository: incubator-hivemall
Updated Branches:
refs/heads/master a31d0aab3 -> 211c28036
Close #77: [HIVEMALL-98] Feature binning documents
Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/211c2803
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/211c2803
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/211c2803
Branch: refs/heads/master
Commit: 211c28036e4a7e7549b3e21fae723f207d85aa09
Parents: a31d0aa
Author: Ryuichi Ito <ma...@sapphire.in.net>
Authored: Mon May 8 17:39:44 2017 +0900
Committer: myui <yu...@gmail.com>
Committed: Mon May 8 17:39:44 2017 +0900
----------------------------------------------------------------------
docs/gitbook/SUMMARY.md | 9 +-
docs/gitbook/ft_engineering/binning.md | 162 +++++++++++++++++++
.../gitbook/ft_engineering/feature_selection.md | 155 ------------------
docs/gitbook/ft_engineering/scaling.md | 4 +-
docs/gitbook/ft_engineering/selection.md | 155 ++++++++++++++++++
docs/gitbook/ft_engineering/vectorization.md | 61 +++++++
docs/gitbook/ft_engineering/vectorizer.md | 61 -------
7 files changed, 385 insertions(+), 222 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/SUMMARY.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md
index 3d035d7..809a548 100644
--- a/docs/gitbook/SUMMARY.md
+++ b/docs/gitbook/SUMMARY.md
@@ -55,14 +55,13 @@
* [Feature Scaling](ft_engineering/scaling.md)
* [Feature Hashing](ft_engineering/hashing.md)
-* [TF-IDF calculation](ft_engineering/tfidf.md)
-
+* [Feature Selection](ft_engineering/selection.md)
+* [Feature Binning](ft_engineering/binning.md)
+* [TF-IDF Calculation](ft_engineering/tfidf.md)
* [FEATURE TRANSFORMATION](ft_engineering/ft_trans.md)
- * [Vectorize Features](ft_engineering/vectorizer.md)
+ * [Feature Vectorization](ft_engineering/vectorization.md)
* [Quantify non-number features](ft_engineering/quantify.md)
-* [Feature selection](ft_engineering/feature_selection.md)
-
## Part IV - Evaluation
* [Statistical evaluation of a prediction model](eval/stat_eval.md)
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/binning.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/binning.md b/docs/gitbook/ft_engineering/binning.md
new file mode 100644
index 0000000..cd1ecbb
--- /dev/null
+++ b/docs/gitbook/ft_engineering/binning.md
@@ -0,0 +1,162 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+Feature binning is a method of dividing quantitative variables into categorical values.
+It groups quantitative values into a pre-defined number of bins.
+
+*Note: This feature is supported from Hivemall v0.5-rc.1 or later.*
+
+<!-- toc -->
+
+# Usage
+
+Prepare sample data (*users* table) first as follows:
+
+``` sql
+CREATE TABLE users (
+ name string, age int, gender string
+);
+
+INSERT INTO users VALUES
+ ('Jacob', 20, 'Male'),
+ ('Mason', 22, 'Male'),
+ ('Sophia', 35, 'Female'),
+ ('Ethan', 55, 'Male'),
+ ('Emma', 15, 'Female'),
+ ('Noah', 46, 'Male'),
+ ('Isabella', 20, 'Female');
+```
+
+## A. Feature Vector trasformation by applying Feature Binning
+
+``` sql
+WITH t AS (
+ SELECT
+ array_concat(
+ categorical_features(
+ array('name', 'gender'),
+ name, gender
+ ),
+ quantitative_features(
+ array('age'),
+ age
+ )
+ ) AS features
+ FROM
+ users
+),
+bins AS (
+ SELECT
+ map('age', build_bins(age, 3)) AS quantiles_map
+ FROM
+ users
+)
+SELECT
+ feature_binning(features, quantiles_map) AS features
+FROM
+ t CROSS JOIN bins;
+```
+
+*Result*
+
+| features: `array<features::string>` |
+| :-: |
+| ["name#Jacob","gender#Male","age:1"] |
+| ["name#Mason","gender#Male","age:1"] |
+| ["name#Sophia","gender#Female","age:2"] |
+| ["name#Ethan","gender#Male","age:2"] |
+| ["name#Emma","gender#Female","age:0"] |
+| ["name#Noah","gender#Male","age:2"] |
+| ["name#Isabella","gender#Female","age:1"] |
+
+
+## B. Get a mapping table by Feature Binning
+
+```sql
+WITH bins AS (
+ SELECT build_bins(age, 3) AS quantiles
+ FROM users
+)
+SELECT
+ age, feature_binning(age, quantiles) AS bin
+FROM
+ users CROSS JOIN bins;
+```
+
+*Result*
+
+| age:` int` | bin: `int` |
+|:-:|:-:|
+| 20 | 1 |
+| 22 | 1 |
+| 35 | 2 |
+| 55 | 2 |
+| 15 | 0 |
+| 46 | 2 |
+| 20 | 1 |
+
+# Function Signature
+
+## [UDAF] `build_bins(weight, num_of_bins[, auto_shrink])`
+
+### Input
+
+| weight: int|bigint|float|double | num\_of\_bins: `int` | [auto\_shrink: `boolean` = false] |
+| :-: | :-: | :-: |
+| weight | 2 <= | behavior when separations are repeated: T=\>skip, F=\>exception |
+
+### Output
+
+| quantiles: `array<double>` |
+| :-: |
+| array of separation value |
+
+> #### Note
+> There is the possibility quantiles are repeated because of too many `num_of_bins` or too few data.
+> If `auto_shrink` is true, skip duplicated quantiles. If not, throw an exception.
+
+## [UDF] `feature_binning(features, quantiles_map)/(weight, quantiles)`
+
+### Variation: A
+
+#### Input
+
+| features: `array<features::string>` | quantiles\_map: `map<string, array<double>>` |
+| :-: | :-: |
+| serialized feature | entry:: key: col name, val: quantiles |
+
+#### Output
+
+| features: `array<feature::string>` |
+| :-: |
+| serialized and binned features |
+
+### Variation: B
+
+#### Input
+
+| weight: int|bigint|float|double | quantiles: `array<double>` |
+| :-: | :-: |
+| weight | array of separation value |
+
+#### Output
+
+| bin: `int` |
+| :-: |
+| categorical value (bin ID) |
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/feature_selection.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/feature_selection.md b/docs/gitbook/ft_engineering/feature_selection.md
deleted file mode 100644
index b19ba56..0000000
--- a/docs/gitbook/ft_engineering/feature_selection.md
+++ /dev/null
@@ -1,155 +0,0 @@
-<!--
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-[Feature Selection](https://en.wikipedia.org/wiki/Feature_selection) is the process of selecting a subset of relevant features for use in model construction.
-
-It is a useful technique to 1) improve prediction results by omitting redundant features, 2) to shorten training time, and 3) to know important features for prediction.
-
-*Note: This feature is supported from Hivemall v0.5-rc.1 or later.*
-
-<!-- toc -->
-
-# Supported Feature Selection algorithms
-
-* Chi-square (Chi2)
- * In statistics, the $$\chi^2$$ test is applied to test the independence of two even events. Chi-square statistics between every feature variable and the target variable can be applied to Feature Selection. Refer [this article](http://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html) for Mathematical details.
-* Signal Noise Ratio (SNR)
- * The Signal Noise Ratio (SNR) is a univariate feature ranking metric, which can be used as a feature selection criterion for binary classification problems. SNR is defined as $$|\mu_{1} - \mu_{2}| / (\sigma_{1} + \sigma_{2})$$, where $$\mu_{k}$$ is the mean value of the variable in classes $$k$$, and $$\sigma_{k}$$ is the standard deviations of the variable in classes $$k$$. Clearly, features with larger SNR are useful for classification.
-
-# Usage
-
-## Feature Selection based on Chi-square test
-
-``` sql
-CREATE TABLE input (
- X array<double>, -- features
- Y array<int> -- binarized label
-);
-
-set hivevar:k=2;
-
-WITH stats AS (
- SELECT
- transpose_and_dot(Y, X) AS observed, -- array<array<double>>, shape = (n_classes, n_features)
- array_sum(X) AS feature_count, -- n_features col vector, shape = (1, array<double>)
- array_avg(Y) AS class_prob -- n_class col vector, shape = (1, array<double>)
- FROM
- input
-),
-test AS (
- SELECT
- transpose_and_dot(class_prob, feature_count) AS expected -- array<array<double>>, shape = (n_class, n_features)
- FROM
- stats
-),
-chi2 AS (
- SELECT
- chi2(r.observed, l.expected) AS v -- struct<array<double>, array<double>>, each shape = (1, n_features)
- FROM
- test l
- CROSS JOIN stats r
-)
-SELECT
- select_k_best(l.X, r.v.chi2, ${k}) as features -- top-k feature selection based on chi2 score
-FROM
- input l
- CROSS JOIN chi2 r;
-```
-
-## Feature Selection based on Signal Noise Ratio (SNR)
-
-``` sql
-CREATE TABLE input (
- X array<double>, -- features
- Y array<int> -- binarized label
-);
-
-set hivevar:k=2;
-
-WITH snr AS (
- SELECT snr(X, Y) AS snr -- aggregated SNR as array<double>, shape = (1, #features)
- FROM input
-)
-SELECT
- select_k_best(X, snr, ${k}) as features
-FROM
- input
- CROSS JOIN snr;
-```
-
-# Function signatures
-
-### [UDAF] `transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>`
-
-##### Input
-
-| `array<number>` X | `array<number>` Y |
-| :-: | :-: |
-| a row of matrix | a row of matrix |
-
-##### Output
-
-| `array<array<double>>` dot product |
-| :-: |
-| `dot(X.T, Y)` of shape = (X.#cols, Y.#cols) |
-
-### [UDF] `select_k_best(X::array<number>, importance_list::array<number>, k::int)::array<double>`
-
-##### Input
-
-| `array<number>` X | `array<number>` importance_list | `int` k |
-| :-: | :-: | :-: |
-| feature vector | importance of each feature | the number of features to be selected |
-
-##### Output
-
-| `array<array<double>>` k-best features |
-| :-: |
-| top-k elements from feature vector `X` based on importance list |
-
-### [UDF] `chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>`
-
-##### Input
-
-| `array<number>` observed | `array<number>` expected |
-| :-: | :-: |
-| observed features | expected features `dot(class_prob.T, feature_count)` |
-
-Both of `observed` and `expected` have a shape `(#classes, #features)`
-
-##### Output
-
-| `struct<array<double>, array<double>>` importance_list |
-| :-: |
-| chi2-value and p-value for each feature |
-
-### [UDAF] `snr(X::array<number>, Y::array<int>)::array<double>`
-
-##### Input
-
-| `array<number>` X | `array<int>` Y |
-| :-: | :-: |
-| feature vector | one hot label |
-
-##### Output
-
-| `array<double>` importance_list |
-| :-: |
-| Signal Noise Ratio for each feature |
-
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/scaling.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/scaling.md b/docs/gitbook/ft_engineering/scaling.md
index 26d82bd..7f388d6 100644
--- a/docs/gitbook/ft_engineering/scaling.md
+++ b/docs/gitbook/ft_engineering/scaling.md
@@ -16,7 +16,9 @@
specific language governing permissions and limitations
under the License.
-->
-
+
+<!-- toc -->
+
# Min-Max Normalization
http://en.wikipedia.org/wiki/Feature_scaling#Rescaling
```sql
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/selection.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/selection.md b/docs/gitbook/ft_engineering/selection.md
new file mode 100644
index 0000000..b19ba56
--- /dev/null
+++ b/docs/gitbook/ft_engineering/selection.md
@@ -0,0 +1,155 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+[Feature Selection](https://en.wikipedia.org/wiki/Feature_selection) is the process of selecting a subset of relevant features for use in model construction.
+
+It is a useful technique to 1) improve prediction results by omitting redundant features, 2) to shorten training time, and 3) to know important features for prediction.
+
+*Note: This feature is supported from Hivemall v0.5-rc.1 or later.*
+
+<!-- toc -->
+
+# Supported Feature Selection algorithms
+
+* Chi-square (Chi2)
+ * In statistics, the $$\chi^2$$ test is applied to test the independence of two even events. Chi-square statistics between every feature variable and the target variable can be applied to Feature Selection. Refer [this article](http://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html) for Mathematical details.
+* Signal Noise Ratio (SNR)
+ * The Signal Noise Ratio (SNR) is a univariate feature ranking metric, which can be used as a feature selection criterion for binary classification problems. SNR is defined as $$|\mu_{1} - \mu_{2}| / (\sigma_{1} + \sigma_{2})$$, where $$\mu_{k}$$ is the mean value of the variable in classes $$k$$, and $$\sigma_{k}$$ is the standard deviations of the variable in classes $$k$$. Clearly, features with larger SNR are useful for classification.
+
+# Usage
+
+## Feature Selection based on Chi-square test
+
+``` sql
+CREATE TABLE input (
+ X array<double>, -- features
+ Y array<int> -- binarized label
+);
+
+set hivevar:k=2;
+
+WITH stats AS (
+ SELECT
+ transpose_and_dot(Y, X) AS observed, -- array<array<double>>, shape = (n_classes, n_features)
+ array_sum(X) AS feature_count, -- n_features col vector, shape = (1, array<double>)
+ array_avg(Y) AS class_prob -- n_class col vector, shape = (1, array<double>)
+ FROM
+ input
+),
+test AS (
+ SELECT
+ transpose_and_dot(class_prob, feature_count) AS expected -- array<array<double>>, shape = (n_class, n_features)
+ FROM
+ stats
+),
+chi2 AS (
+ SELECT
+ chi2(r.observed, l.expected) AS v -- struct<array<double>, array<double>>, each shape = (1, n_features)
+ FROM
+ test l
+ CROSS JOIN stats r
+)
+SELECT
+ select_k_best(l.X, r.v.chi2, ${k}) as features -- top-k feature selection based on chi2 score
+FROM
+ input l
+ CROSS JOIN chi2 r;
+```
+
+## Feature Selection based on Signal Noise Ratio (SNR)
+
+``` sql
+CREATE TABLE input (
+ X array<double>, -- features
+ Y array<int> -- binarized label
+);
+
+set hivevar:k=2;
+
+WITH snr AS (
+ SELECT snr(X, Y) AS snr -- aggregated SNR as array<double>, shape = (1, #features)
+ FROM input
+)
+SELECT
+ select_k_best(X, snr, ${k}) as features
+FROM
+ input
+ CROSS JOIN snr;
+```
+
+# Function signatures
+
+### [UDAF] `transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>`
+
+##### Input
+
+| `array<number>` X | `array<number>` Y |
+| :-: | :-: |
+| a row of matrix | a row of matrix |
+
+##### Output
+
+| `array<array<double>>` dot product |
+| :-: |
+| `dot(X.T, Y)` of shape = (X.#cols, Y.#cols) |
+
+### [UDF] `select_k_best(X::array<number>, importance_list::array<number>, k::int)::array<double>`
+
+##### Input
+
+| `array<number>` X | `array<number>` importance_list | `int` k |
+| :-: | :-: | :-: |
+| feature vector | importance of each feature | the number of features to be selected |
+
+##### Output
+
+| `array<array<double>>` k-best features |
+| :-: |
+| top-k elements from feature vector `X` based on importance list |
+
+### [UDF] `chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>`
+
+##### Input
+
+| `array<number>` observed | `array<number>` expected |
+| :-: | :-: |
+| observed features | expected features `dot(class_prob.T, feature_count)` |
+
+Both of `observed` and `expected` have a shape `(#classes, #features)`
+
+##### Output
+
+| `struct<array<double>, array<double>>` importance_list |
+| :-: |
+| chi2-value and p-value for each feature |
+
+### [UDAF] `snr(X::array<number>, Y::array<int>)::array<double>`
+
+##### Input
+
+| `array<number>` X | `array<int>` Y |
+| :-: | :-: |
+| feature vector | one hot label |
+
+##### Output
+
+| `array<double>` importance_list |
+| :-: |
+| Signal Noise Ratio for each feature |
+
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/vectorization.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/vectorization.md b/docs/gitbook/ft_engineering/vectorization.md
new file mode 100644
index 0000000..21fcea7
--- /dev/null
+++ b/docs/gitbook/ft_engineering/vectorization.md
@@ -0,0 +1,61 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+## Feature Vectorization
+
+`array<string> vectorize_feature(array<string> featureNames, ...)` is useful to generate a feature vector for each row, from a table.
+
+```sql
+select vectorize_features(array("a","b"),"0.2","0.3") from dual;
+>["a:0.2","b:0.3"]
+
+-- avoid zero weight
+select vectorize_features(array("a","b"),"0.2",0) from dual;
+> ["a:0.2"]
+
+-- true boolean value is treated as 1.0 (a categorical value w/ its column name)
+select vectorize_features(array("a","b","bool"),0.2,0.3,true) from dual;
+> ["a:0.2","b:0.3","bool:1.0"]
+
+-- an example to generate feature vectors from table
+select * from dual;
+> 1
+select vectorize_features(array("a"),*) from dual;
+> ["a:1.0"]
+
+-- has categorical feature
+select vectorize_features(array("a","b","wheather"),"0.2","0.3","sunny") from dual;
+> ["a:0.2","b:0.3","whether#sunny"]
+```
+
+```sql
+select
+ id,
+ vectorize_features(
+ array("age","job","marital","education","default","balance","housing","loan","contact","day","month","duration","campaign","pdays","previous","poutcome"),
+ age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
+ ) as features,
+ y
+from
+ train
+limit 2;
+```
+
+> 1 ["age:39.0","job#blue-collar","marital#married","education#secondary","default#no","balance:1756.0","housing#yes","loan#no","contact#cellular","day:3.0","month#apr","duration:939.0","campaign:1.0","pdays:-1.0","poutcome#unknown"] 1
+> 2 ["age:51.0","job#entrepreneur","marital#married","education#primary","default#no","balance:1443.0","housing#no","loan#no","contact#cellular","day:18.0","month#feb","duration:172.0","campaign:10.0","pdays:-1.0","poutcome#unknown"] 1
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/vectorizer.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/vectorizer.md b/docs/gitbook/ft_engineering/vectorizer.md
deleted file mode 100644
index 59038d1..0000000
--- a/docs/gitbook/ft_engineering/vectorizer.md
+++ /dev/null
@@ -1,61 +0,0 @@
-<!--
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-## Feature Vectorizer
-
-`array<string> vectorize_feature(array<string> featureNames, ...)` is useful to generate a feature vector for each row, from a table.
-
-```sql
-select vectorize_features(array("a","b"),"0.2","0.3") from dual;
->["a:0.2","b:0.3"]
-
--- avoid zero weight
-select vectorize_features(array("a","b"),"0.2",0) from dual;
-> ["a:0.2"]
-
--- true boolean value is treated as 1.0 (a categorical value w/ its column name)
-select vectorize_features(array("a","b","bool"),0.2,0.3,true) from dual;
-> ["a:0.2","b:0.3","bool:1.0"]
-
--- an example to generate feature vectors from table
-select * from dual;
-> 1
-select vectorize_features(array("a"),*) from dual;
-> ["a:1.0"]
-
--- has categorical feature
-select vectorize_features(array("a","b","wheather"),"0.2","0.3","sunny") from dual;
-> ["a:0.2","b:0.3","whether#sunny"]
-```
-
-```sql
-select
- id,
- vectorize_features(
- array("age","job","marital","education","default","balance","housing","loan","contact","day","month","duration","campaign","pdays","previous","poutcome"),
- age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
- ) as features,
- y
-from
- train
-limit 2;
-
-> 1 ["age:39.0","job#blue-collar","marital#married","education#secondary","default#no","balance:1756.0","housing#yes","loan#no","contact#cellular","day:3.0","month#apr","duration:939.0","campaign:1.0","pdays:-1.0","poutcome#unknown"] 1
-> 2 ["age:51.0","job#entrepreneur","marital#married","education#primary","default#no","balance:1443.0","housing#no","loan#no","contact#cellular","day:18.0","month#feb","duration:172.0","campaign:10.0","pdays:-1.0","poutcome#unknown"] 1
-```
\ No newline at end of file