You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hivemall.apache.org by my...@apache.org on 2016/12/01 05:25:26 UTC
[02/50] [abbrv] incubator-hivemall git commit: Added a gitbook userguide

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/recommend/item_based_cf.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/recommend/item_based_cf.md b/docs/gitbook/recommend/item_based_cf.md
new file mode 100644
index 0000000..a4a8cfd
--- /dev/null
+++ b/docs/gitbook/recommend/item_based_cf.md
@@ -0,0 +1,565 @@
+This document describe how to do Item-based Collaborative Filtering using Hivemall.
+
+_Caution: naive similarity computation is `O(n^2)` to compute all item-item pair similarity. [MinHash](https://en.wikipedia.org/wiki/MinHash#Jaccard_similarity_and_minimum_hash_values) is an efficient scheme for computing jaccard similarity. Section 6 show how to use MinHash in Hivemall._
+
+## 1. Prepare transaction table
+
+Prepare following transaction table. We are generating `feature_vector` for each `item_id` based on cooccurrence of purchased items, a sort of bucket analysis.
+
+| userid | itemid | purchase_at `timestamp` |
+|:-:|:-:|:-:| 
+| 1 | 31231 | 2015-04-9 00:29:02 |
+| 1 | 13212 | 2016-05-24 16:29:02 |
+| 2 | 312 | 2016-06-03 23:29:02 |
+| 3 | 2313 | 2016-06-04 19:29:02 |
+| .. | .. | .. |
+
+## 2. Create item_features table
+
+What we want for creating a feature vector for each item is the following `cooccurrence` relation.
+
+| itemid | other | cnt |
+|:-:|:-:|:-:|
+| 583266 | 621056 | 9999 |
+| 583266 | 583266 | 18 |
+| 31231 | 13212 | 129 |
+| 31231 | 31231 | 3 |
+| 31231	| 9833 | 953 |
+| ... | ... | ... |
+
+Feature vectors of each item will be as follows:
+
+| itemid | feature_vector `array<string>` |
+|:-:|:-:|
+| 583266 | 621056:9999, 583266:18 |
+| 31231 | 13212:129, 31231:3, 9833:953 |
+| ... | ... |
+
+Note that value of feature vector should be scaled for k-NN similarity computation e.g., as follows:
+
+| itemid | feature_vector `array<string>` |
+|:-:|:-:|
+| 583266 | 621056:`ln(9999+1)`, 583266:`ln(18+1)` |
+| 31231 | 13212:`ln(129+1)`, 31231:`ln(3+1)`, 9833:`ln(953+1)` |
+| ... | ... |
+
+The following queries results in creating the above table.
+
+### 2.1. Creating Item purchased table
+
+The following query creates a table that contains userid, itemid, and purchased_at. The table represents the last user-item contact (purchase) while the `transaction` table holds all contacts.
+
+```sql
+CREATE TABLE user_purchased as
+-- INSERT OVERWRITE TABLE user_purchased
+select 
+  userid,
+  itemid,
+  max(purchased_at) as purchased_at,
+  count(1) as purchase_count
+from
+  transaction
+-- where purchased_at < xxx -- divide training/testing data by time 
+group by
+  userid, itemid
+;
+```
+
+**Note:** _Better to avoid too old transactions because those information would be outdated though an enough number of transactions is required for recommendation._
+
+### 2.2. Creating cooccurrence table
+
+**Caution:** _Item-Item cooccurrence matrix is a symmetric matrix that has the number of total occurrence for each diagonal element . If the size of items are `k`, then the size of expected matrix is `k * (k - 1) / 2`, usually a very large one._
+
+_Better to use [2.2.2.](https://github.com/myui/hivemall/wiki/Item-based-Collaborative-Filtering#limiting-size-of-elements-in-cooccurrence_upper_triangular) instead of [2.2.1.](https://github.com/myui/hivemall/wiki/Item-based-Collaborative-Filtering#221-create-cooccurrence-table-directly) for creating a `cooccurrence` table where dataset is large._
+
+### 2.2.1. Create cooccurrence table directly
+
+```sql
+create table cooccurrence as 
+-- INSERT OVERWRITE TABLE cooccurrence
+select
+  u1.itemid,
+  u2.itemid as other, 
+  count(1) as cnt
+from
+  user_purchased u1
+  JOIN user_purchased u2 ON (u1.userid = u2.userid)
+where
+  u1.itemid != u2.itemid 
+  -- AND u2.purchased_at >= u1.purchased_at -- the other item should be purchased with/after the base item
+group by
+  u1.itemid, u2.itemid
+-- having -- optional but recommended to have this condition where dataset is large
+--  cnt >= 2 -- count(1) >= 2
+;
+```
+
+**Caution:** Note that specifying `having cnt >= 2` has a drawback that item cooccurrence is not calculated where `cnt` is less than 2. It could result no recommended items for certain items. Please ignore `having cnt >= 2` if the following computations finish in an acceptable/reasonable time.
+
+**Caution:** _We ignore a purchase order in the following example. It means that the occurrence counts of `ItemA -> ItemB` and `ItemB -> ItemA` are assumed to be same. It is sometimes not a good idea e.g., for `Camera -> SD card` and `SD card -> Camera`._
+
+### 2.2.2. Create cooccurrence table from Upper Triangular Matrix of cooccurrence
+
+Better to create [Upper Triangular Matrix](https://en.wikipedia.org/wiki/Triangular_matrix#Description) that has `itemid > other` if resulting table is very large. No need to create Upper Triangular Matrix if your Hadoop cluster can handle the following instructions without considering it.
+
+```sql
+create table cooccurrence_upper_triangular as 
+-- INSERT OVERWRITE TABLE cooccurrence_upper_triangular
+select
+  u1.itemid,
+  u2.itemid as other, 
+  count(1) as cnt
+from
+  user_purchased u1
+  JOIN user_purchased u2 ON (u1.userid = u2.userid)
+where
+  u1.itemid > u2.itemid 
+group by
+  u1.itemid, u2.itemid
+;
+```
+
+```sql
+create table cooccurrence as 
+-- INSERT OVERWRITE TABLE cooccurrence
+select * from (
+  select itemid, other, cnt from cooccurrence_upper_triangular
+  UNION ALL
+  select other as itemid, itemid as other, cnt from cooccurrence_upper_triangular
+) t; 
+```
+
+_Note: `UNION ALL` [required to be embedded](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union#LanguageManualUnion-UNIONwithinaFROMClause) in Hive._
+
+### Limiting size of elements in cooccurrence_upper_triangular
+
+```sql
+create table cooccurrence_upper_triangular as
+WITH t1 as (
+  select
+    u1.itemid,
+    u2.itemid as other, 
+    count(1) as cnt
+  from
+    user_purchased u1
+    JOIN user_purchased u2 ON (u1.userid = u2.userid)
+  where
+    u1.itemid > u2.itemid 
+  group by
+    u1.itemid, u2.itemid
+),
+t2 as (
+  select
+    each_top_k( -- top 1000
+      1000, itemid, cnt, 
+      itemid, other, cnt
+    ) as (rank, cmpkey, itemid, other, cnt)
+  from (
+    select * from t1
+    CLUSTER BY itemid
+  ) t;
+)
+-- INSERT OVERWRITE TABLE cooccurrence_upper_triangular
+select itemid, other, cnt
+from t2;
+```
+
+```sql
+create table cooccurrence as 
+WITh t1 as (
+  select itemid, other, cnt from cooccurrence_upper_triangular
+  UNION ALL
+  select other as itemid, itemid as other, cnt from cooccurrence_upper_triangular
+),
+t2 as (
+  select
+    each_top_k(
+      1000, itemid, cnt,
+      itemid, other, cnt
+    ) as (rank, cmpkey, itemid, other, cnt)
+  from (
+    select * from t1
+    CLUSTER BY itemid
+  ) t
+)
+-- INSERT OVERWRITE TABLE cooccurrence
+select itemid, other, cnt
+from t2;
+```
+
+### 2.2.3. Computing cooccurrence ratio (optional step)
+
+You can optionally compute cooccurrence ratio as follows:
+
+```sql
+WITH stats as (
+  select 
+    itemid,
+    sum(cnt) as totalcnt
+  from 
+    cooccurrence
+  group by
+    itemid
+)
+INSERT OVERWRITE TABLE cooccurrence_ratio
+SELECT
+  l.itemid,
+  l.other, 
+  (l.cnt / r.totalcnt) as ratio
+FROM
+  cooccurrence l
+  JOIN stats r ON (l.itemid = r.itemid)
+group by
+  l.itemid, l.other
+;
+```
+
+`l.cnt / r.totalcnt` represents a cooccurrence ratio of range `[0,1]`.
+
+### 2.3. creating a feature vector for each item
+
+```sql
+INSERT OVERWRITE TABLE item_features
+SELECT
+  itemid,
+  -- scaling `ln(cnt+1)` to avoid large value in the feature vector
+  -- rounding to xxx.yyyyy to reduce size of feature_vector in array<string>
+  collect_list(feature(other, round(ln(cnt+1),5))) as feature_vector
+FROM
+  cooccurrence
+GROUP BY
+  itemid
+;
+```
+
+## 3. Computing Item similarity scores
+
+Item-Item similarity computation is known to be computation complexity `O(n^2)` where `n` is the number of items.
+Depending on your cluster size and your dataset, the optimal solution differs.
+
+**Note:** _Better to use [3.1.1.](https://github.com/myui/hivemall/wiki/Item-based-Collaborative-Filtering#311-similarity-computation-using-the-symmetric-property-of-item-similarity-matrix) scheme where dataset is large._
+
+### 3.1. Shuffle heavy similarity computation
+
+This version involves 3-way joins w/ large data shuffle; However, this version works in parallel where a cluster has enough task slots.
+
+```sql
+WITH similarity as (
+  select
+    o.itemid,
+    o.other,
+    cosine_similarity(t1.feature_vector, t2.feature_vector) as similarity
+  from
+    cooccurrence o
+    JOIN item_features t1 ON (o.itemid = t1.itemid)
+    JOIN item_features t2 ON (o.other = t2.itemid)
+),
+topk as (
+  select
+    each_top_k( -- get top-10 items based on similarity score
+      10, itemid, similarity,
+      itemid, other -- output items
+    ) as (rank, similarity, itemid, other)
+  from (
+    select * from similarity
+    where similarity > 0 -- similarity > 0.01
+    CLUSTER BY itemid
+  ) t
+)
+INSERT OVERWRITE TABLE item_similarity
+select 
+  itemid, other, similarity
+from 
+  topk;
+```
+
+### 3.1.1. Similarity computation using the symmetric property of Item similarity matrix
+
+Note `item_similarity` is a similarity matrix. So, you can compute it from an upper triangular matrix as follows.
+
+```sql
+WITH cooccurrence_top100 as (
+  select
+    each_top_k(
+      100, itemid, cnt,  
+      itemid, other
+    ) as (rank, cmpkey, itemid, other)
+  from (
+    select * from cooccurrence_upper_triangular
+    CLUSTER BY itemid
+  ) t
+), 
+similarity as (
+  select
+    o.itemid,
+    o.other,
+    cosine_similarity(t1.feature_vector, t2.feature_vector) as similarity
+  from
+    cooccurrence_top100 o\u3000
+    -- cooccurrence_upper_triangular o
+    JOIN item_features t1 ON (o.itemid = t1.itemid)
+    JOIN item_features t2 ON (o.other = t2.itemid)
+),
+topk as (
+  select
+    each_top_k( -- get top-10 items based on similarity score
+      10, itemid, similarity,
+      itemid, other -- output items
+    ) as (rank, similarity, itemid, other)
+  from (
+    select * from similarity
+    where similarity > 0 -- similarity > 0.01
+    CLUSTER BY itemid
+  ) t
+)
+INSERT OVERWRITE TABLE item_similarity_upper_triangler
+select 
+  itemid, other, similarity
+from 
+  topk;
+```
+
+```sql
+INSERT OVERWRITE TABLE item_similarity
+select * from (
+  select itemid, other, similarity from item_similarity_upper_triangler
+  UNION ALL
+  select other as itemid, itemid as other, similarity from item_similarity_upper_triangler
+) t;
+```
+
+### 3.2. Computation heavy similarity computation 
+
+Alternatively, you can compute cosine similarity as follows. This version involves cross join and thus runs sequentially in a single task. However, it involves less shuffle when compared to 3.1.
+
+```sql
+WITH similarity as (
+  select
+   t1.itemid,
+   t2.itemid as other,
+   cosine_similarity(t1.feature_vector, t2.feature_vector) as similarity
+  from
+   item_features t1
+   CROSS JOIN item_features t2
+  WHERE
+    t1.itemid != t2.itemid
+),
+topk as (
+  select
+    each_top_k( -- get top-10 items based on similarity score
+      10, itemid, similarity,
+      itemid, other -- output items
+    ) as (rank, similarity, itemid, other)
+  from (
+    select * from similarity
+    where similarity > 0 -- similarity > 0.01
+    CLUSTER BY itemid
+  ) t
+)
+INSERT OVERWRITE TABLE item_similarity
+select 
+  itemid, other, similarity
+from 
+  topk
+;
+```
+
+| item | other | similarity |
+|:-:|:-:|:-:|
+| 583266 | 621056 | 0.33 |
+| 583266 | 583266 | 0.18 |
+| 31231 | 13212 | 1.29 |
+| 31231 | 31231 | 0.3 |
+| 31231	| 9833 | 0.953 |
+| ... | ... | ... |
+
+## 4. Item-based Recommendation
+
+This section introduces item-based recommendation based on recently purchased items by each user.
+
+**Caution:** _It would better to ignore recommending some of items that user already purchased (only 1 time) while items that are purchased twice or more would be okey to be included in the recommendation list (e.g., repeatedly purchased daily necessities). So, you would need an item property table showing that each item is repeatedly purchased items or not._
+
+### 4.1. Computes top-k recently purchaed items for each user
+
+First, prepare `recently_purchased_items` table as follows:
+
+```sql
+INSERT OVERWRITE TABLE recently_purchased_items
+select
+  each_top_k( -- get top-5 recently purchased items for each user
+     5, userid, purchased_at,
+     userid, itemid
+  ) as (rank, purchased_at, userid, itemid)
+from (
+  select
+    purchased_at, userid, itemid
+  from 
+    user_purchased
+  -- where [optional filtering]
+  --  purchased_at >= xxx -- divide training/test data by time
+  CLUSTER BY
+    user_id -- Note CLUSTER BY is mandatory when using each_top_k
+) t;
+```
+
+### 4.2. Recommend top-k items based on the cooccurrence for each user's recently purchased item
+
+```sql
+WITH topk as (
+  select
+    each_top_k(
+       5, userid, cnt,
+       userid, other
+    ) as (rank, cnt, userid, rec_item)
+  from (
+    select 
+      t1.userid, t2.other, max(t2.cnt) as cnt
+    from
+      recently_purchased_items t1
+      JOIN cooccurrence t2 ON (t1.itemid = t2.itemid)
+    where
+      t1.itemid != t2.other -- do not include items that user already purchased
+      AND NOT EXISTS (
+        SELECT a.itemid FROM user_purchased a
+        WHERE a.userid = t1.userid AND a.itemid = t2.other
+--        AND a.purchased_count <= 1 -- optional
+      )
+    group by
+      t1.userid, t2.other
+    CLUSTER BY
+      userid -- top-k grouping by userid
+  ) t1
+)
+INSERT OVERWRITE TABLE item_recommendation
+select
+  userid,
+  map_values(to_ordered_map(rank, rec_item)) as rec_items
+from
+  topk
+group by
+  userid
+;
+```
+
+### 4.3. Recommend top-k items based on the (cooccurrence) similarity for each user's recently purchased item
+
+```sql
+WITH topk as (
+  select
+    each_top_k(
+       5, userid, similarity,
+       userid, other
+    ) as (rank, similarity, userid, rec_item)
+  from (
+    select
+      t1.userid, t2.other, max(t2.similarity) as similarity
+    from
+      recently_purchased_items t1
+      JOIN item_similarity t2 ON (t1.itemid = t2.itemid)
+    where
+      t1.itemid != t2.other -- do not include items that user already purchased
+      AND NOT EXISTS (
+        SELECT a.itemid FROM user_purchased a
+        WHERE a.userid = t1.userid AND a.itemid = t2.other
+--        AND a.purchased_count <= 1 -- optional
+      )
+    group by
+      t1.userid, t2.other
+    CLUSTER BY
+      userid -- top-k grouping by userid
+  ) t1
+)
+INSERT OVERWRITE TABLE item_recommendation
+select
+  userid,
+  map_values(to_ordered_map(rank, rec_item)) as rec_items
+from
+  topk
+group by
+  userid
+;
+```
+
+## 5. Pseudo Jaccard Similarity computation using MinHash
+
+Refer [this article](https://en.wikipedia.org/wiki/MinHash#Jaccard_similarity_and_minimum_hash_values
+) to get details about MinHash and Jarccard similarity. [This blog article](https://blog.treasuredata.com/blog/2016/02/16/minhash-in-hivemall/) also explains about Hivemall's minhash.
+
+```sql
+INSERT OVERWRITE TABLE minhash -- results in 30x records of item_features
+select  
+  -- assign 30 minhash values for each item
+  minhash(itemid, feature_vector, "-n 30") as (clusterid, itemid) -- '-n' would be 10~100
+from
+  item_features
+;
+
+WITH t1 as (
+  select
+    l.itemid,
+    r.itemid as other,
+    count(1) / 30 as similarity -- Pseudo jaccard similarity '-n 30'
+  from
+    minhash l 
+    JOIN minhash r 
+      ON (l.clusterid = r.clusterid)
+  where 
+    l.itemid != r.itemid
+  group by
+    l.itemid, r.itemid
+  having
+    count(1) >= 3 -- [optional] filtering equals to (count(1)/30) >= 0.1
+),
+top100 as (
+  select
+    each_top_k(100, itemid, similarity, itemid, other)
+      as (rank, similarity, itemid, other)
+  from (
+    select * from t1 
+    -- where similarity >= 0.1 -- Optional filtering. Can be ignored.
+    CLUSTER BY itemid 
+  ) t2
+)
+INSERT OVERWRITE TABLE jaccard_similarity
+select
+  itemid, other, similarity
+from
+  top100
+;
+```
+_Caution: Note that there might be no similar item for certain items._
+
+### 5.1. Cosine similarity computation following minhash-based similarity items filtering
+
+You can compute `top-k` similar items based on cosine similarity, following rough `top-N` similar items listing using minhash, where `k << N` (e.g., k=10 and N=100).
+
+```sql
+WITH similarity as (
+  select
+    o.itemid,
+    o.other,
+    cosine_similarity(t1.feature_vector, t2.feature_vector) as similarity
+  from
+    jaccard_similarity o
+    JOIN item_features t1 ON (o.itemid = t1.itemid)
+    JOIN item_features t2 ON (o.other = t2.itemid)
+),
+topk as (
+  select
+    each_top_k( -- get top-10 items based on similarity score
+      10, itemid, similarity,
+      itemid, other -- output items
+    ) as (rank, similarity, itemid, other)
+  from (
+    select * from similarity
+    where similarity > 0 -- similarity > 0.01
+    CLUSTER BY itemid
+  ) t
+)
+INSERT OVERWRITE TABLE cosine_similarity
+select 
+  itemid, other, similarity
+from 
+  topk;
+```
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/recommend/movielens.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/recommend/movielens.md b/docs/gitbook/recommend/movielens.md
new file mode 100644
index 0000000..e69de29

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/recommend/movielens_cv.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/recommend/movielens_cv.md b/docs/gitbook/recommend/movielens_cv.md
new file mode 100644
index 0000000..ec2255b
--- /dev/null
+++ b/docs/gitbook/recommend/movielens_cv.md
@@ -0,0 +1,63 @@
+[Cross-validation](http://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validationk-fold cross validation) is a model validation technique for assessing how a prediction model will generalize to an independent data set. This example shows a way to perform [k-fold cross validation](http://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation) to evaluate prediction performance.
+
+*Caution:* Matrix factorization is supported in Hivemall v0.3 or later.
+
+# Data set creating for 10-folds cross validation.
+```sql
+use movielens;
+
+set hivevar:kfold=10;
+set hivevar:seed=31;
+
+-- Adding group id (gid) to each training instance
+drop table ratings_groupded;
+create table ratings_groupded
+as
+select
+  rand_gid2(${kfold}, ${seed}) gid, -- generates group id ranging from 1 to 10
+  userid, 
+  movieid, 
+  rating
+from
+  ratings
+cluster by gid, rand(${seed});
+```
+
+## Set training hyperparameters
+
+```sql
+-- latent factors
+set hivevar:factor=10;
+-- maximum number of iterations
+set hivevar:iters=50;
+-- regularization parameter
+set hivevar:lambda=0.05;
+-- learning rate
+set hivevar:eta=0.005;
+-- conversion rate (if changes between iterations became less or equals to ${cv_rate}, the training will stop)
+set hivevar:cv_rate=0.001;
+```
+_Due to [a bug](https://issues.apache.org/jira/browse/HIVE-8396) in Hive, do not issue comments in CLI._
+
+```sql
+select avg(rating) from ratings;
+```
+> 3.581564453029317
+
+```sql
+-- mean rating value (Optional but recommended to set ${mu})
+set hivevar:mu=3.581564453029317;
+```
+_Note that it is not necessary to set an exact value for ${mu}._
+
+## SQL-generation for 10-folds cross validation
+
+Run [generate_cv.sh](https://gist.github.com/myui/c2009e5791cca650a4d0) and create [generate_cv.sql](https://gist.github.com/myui/2e2018217e2188222655).
+
+Then, issue SQL queies in [generate_cv.sql](https://gist.github.com/myui/2e2018217e2188222655) to get MAE/RMSE.
+
+> 0.6695442192077673 (MAE)
+
+> 0.8502739040257945 (RMSE)
+
+_We recommend to use [Tez](http://tez.apache.org/) for running queries having many stages._
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/recommend/movielens_dataset.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/recommend/movielens_dataset.md b/docs/gitbook/recommend/movielens_dataset.md
new file mode 100644
index 0000000..f175f4f
--- /dev/null
+++ b/docs/gitbook/recommend/movielens_dataset.md
@@ -0,0 +1,160 @@
+# Data preparation
+
+First, downlod MovieLens dataset from the following site.
+> http://www.grouplens.org/system/files/ml-1m.zip
+
+Get detail about the dataset in the README.
+> http://files.grouplens.org/papers/ml-1m-README.txt
+
+You can find three dat file in the archive: 
+> movies.dat, ratings.dat, users.dat.
+
+Change column separator as follows:
+```sh
+sed 's/::/#/g' movies.dat > movies.t
+sed 's/::/#/g' ratings.dat > ratings.t
+sed 's/::/#/g' users.dat > users.t
+```
+
+Create a file named occupations.t with the following contents:
+```
+0#other/not specified
+1#academic/educator
+2#artist
+3#clerical/admin
+4#college/grad student
+5#customer service
+6#doctor/health care
+7#executive/managerial
+8#farmer
+9#homemaker
+10#K-12 student
+11#lawyer
+12#programmer
+13#retired
+14#sales/marketing
+15#scientist
+16#self-employed
+17#technician/engineer
+18#tradesman/craftsman
+19#unemployed
+20#writer
+```
+
+# Importing data as Hive tables
+
+## create tables
+```sql
+create database movielens;
+use movielens;
+
+CREATE EXTERNAL TABLE ratings (
+  userid INT, 
+  movieid INT,
+  rating INT, 
+  tstamp STRING
+) ROW FORMAT DELIMITED
+FIELDS TERMINATED BY '#'
+STORED AS TEXTFILE
+LOCATION '/dataset/movielens/ratings';
+
+CREATE EXTERNAL TABLE movies (
+  movieid INT, 
+  title STRING,
+  genres ARRAY<STRING>
+) ROW FORMAT DELIMITED
+FIELDS TERMINATED BY '#'
+COLLECTION ITEMS TERMINATED BY "|"
+STORED AS TEXTFILE
+LOCATION '/dataset/movielens/movies';
+
+CREATE EXTERNAL TABLE users (
+  userid INT, 
+  gender STRING, 
+  age INT,
+  occupation INT,
+  zipcode STRING
+) ROW FORMAT DELIMITED
+FIELDS TERMINATED BY '#'
+STORED AS TEXTFILE
+LOCATION '/dataset/movielens/users';
+
+CREATE EXTERNAL TABLE occupations (
+  id INT,
+  occupation STRING
+) ROW FORMAT DELIMITED
+FIELDS TERMINATED BY '#'
+STORED AS TEXTFILE
+LOCATION '/dataset/movielens/occupations';
+```
+
+## load data into tables
+```sh
+hadoop fs -put ratings.t /dataset/movielens/ratings
+hadoop fs -put movies.t /dataset/movielens/movies
+hadoop fs -put users.t /dataset/movielens/users
+hadoop fs -put occupations.t /dataset/movielens/occupations
+```
+
+# Create a concatenated table 
+```sql
+CREATE TABLE rating_full
+as
+select 
+  r.*, 
+  m.title as m_title,
+  concat_ws('|',sort_array(m.genres)) as m_genres, 
+  u.gender as u_gender,
+  u.age as u_age,
+  u.occupation as u_occupation,
+  u.zipcode as u_zipcode
+from
+  ratings r 
+  JOIN movies m ON (r.movieid = m.movieid)
+  JOIN users u ON (r.userid = u.userid);
+```
+
+hive> desc rating_full;
+```
+userid                  int                     None
+movieid                 int                     None
+rating                  int                     None
+tstamp                  string                  None
+m_title                 string                  None
+m_genres                string                  None
+u_gender                string                  None
+u_age                   int                     None
+u_occupation            int                     None
+u_zipcode               string                  None
+```
+
+---
+# Creating training/testing data
+
+Create a training/testing table such that each has 80%/20% of the original rating data.
+
+```sql
+-- Adding rowids to the rating table
+SET hivevar:seed=31;
+CREATE TABLE ratings2
+as
+select
+  rand(${seed}) as rnd, 
+  userid, 
+  movieid, 
+  rating
+from 
+  ratings;
+
+CREATE TABLE training
+as
+select * from ratings2
+order by rnd DESC
+limit 800000;
+
+CREATE TABLE testing
+as
+select * from ratings2
+order by rnd ASC
+limit 200209;
+```

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/recommend/movielens_fm.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/recommend/movielens_fm.md b/docs/gitbook/recommend/movielens_fm.md
new file mode 100644
index 0000000..282e923
--- /dev/null
+++ b/docs/gitbook/recommend/movielens_fm.md
@@ -0,0 +1,249 @@
+_Caution: Factorization Machine is supported from Hivemall v0.4 or later._
+
+# Data preparation
+
+First of all, please create `ratings` table described in the following page: 
+https://github.com/myui/hivemall/wiki/MovieLens-Dataset
+
+```sql
+use movielens;
+
+SET hivevar:seed=31;
+
+DROP TABLE ratings_fm;
+CREATE TABLE ratings_fm
+as
+select
+	rowid() as rowid,
+	categorical_features(array("userid","movieid"), userid, movieid) 
+	  as features,
+	rating,
+	rand(${seed}) as rnd
+from
+	ratings
+CLUSTER BY rand(43); -- shuffle training input
+
+select * from ratings_fm limit 2;
+```
+
+| rowid | features | rating | rnd |
+|:-----:|:--------:|:------:|:---:|
+| 1-383970 | ["userid#2244","movieid#1272"] | 5 | 0.33947035987020546 |
+| 1-557913 | ["userid#3425","movieid#2791"] | 4 | 0.12344886396954391 |
+
+```sql
+-- use 80% for training
+DROP TABLE training_fm;
+CREATE TABLE training_fm
+as
+select * from ratings_fm
+order by rnd DESC
+limit 800000;
+
+-- use 20% for testing
+DROP TABLE testing_fm;
+CREATE TABLE testing_fm
+as
+select * from ratings_fm
+order by rnd ASC
+limit 200209;
+
+-- testing table for prediction
+CREATE OR REPLACE VIEW testing_fm_exploded
+as 
+select 
+  rowid,
+  extract_feature(fv) as feature,
+  extract_weight(fv) as Xi,
+  rating
+from
+  testing_fm t1 LATERAL VIEW explode(add_bias(features)) t2 as fv;
+```
+_Caution: Don't forget to call `add_bias` in the above query. No need to call `add_bias` for preparing training data in Factorization Machines because it always considers it._
+
+# Training
+
+## Hyperparamters for Training
+```sql
+-- number of factors
+set hivevar:factor=10;
+-- maximum number of training iterations
+set hivevar:iters=50;
+```
+
+## Build a prediction mdoel by Factorization Machine
+
+```sql
+drop table fm_model;
+create table fm_model
+as
+select
+  feature,
+  avg(Wi) as Wi,
+  array_avg(Vif) as Vif
+from (
+  select 
+    train_fm(features, rating, "-factor ${factor} -iters ${iters} -eta 0.01") 
+    	as (feature, Wi, Vif)
+  from 
+    training_fm
+) t
+group by feature;
+```
+
+_Note: setting eta option is optional. However, setting `-eta 0.01` usually works well._
+
+## Usage of `train_fm`
+
+You can get usages of `train_fm` by giving `-help` option as follows:
+```sql
+select 
+  train_fm(features, rating, "-help") as (feature, Wi, Vif)
+from 
+  training_fm
+```
+
+```
+usage: train_fm(array<string> x, double y [, const string options]) -
+       Returns a prediction value [-adareg] [-c] [-cv_rate <arg>]
+       [-disable_cv] [-eta <arg>] [-eta0 <arg>] [-f <arg>] [-help]
+       [-init_v <arg>] [-int_feature] [-iters <arg>] [-lambda <arg>] [-max
+       <arg>] [-maxval <arg>] [-min <arg>] [-min_init_stddev <arg>] [-p
+       <arg>] [-power_t <arg>] [-seed <arg>] [-sigma <arg>] [-t <arg>]
+       [-va_ratio <arg>] [-va_threshold <arg>]
+ -adareg,--adaptive_regularizaion             Whether to enable adaptive
+                                              regularization [default:
+                                              OFF]
+ -c,--classification                          Act as classification
+ -cv_rate,--convergence_rate <arg>            Threshold to determine
+                                              convergence [default: 0.005]
+ -disable_cv,--disable_cvtest                 Whether to disable
+                                              convergence check [default:
+                                              OFF]
+ -eta <arg>                                   The initial learning rate
+ -eta0 <arg>                                  The initial learning rate
+                                              [default 0.1]
+ -f,--factor <arg>                            The number of the latent
+                                              variables [default: 10]
+ -help                                        Show function help
+ -init_v <arg>                                Initialization strategy of
+                                              matrix V [random, gaussian]
+                                              (default: random)
+ -int_feature,--feature_as_integer            Parse a feature as integer
+                                              [default: OFF, ON if -p
+                                              option is specified]
+ -iters,--iterations <arg>                    The number of iterations
+                                              [default: 1]
+ -lambda,--lambda0 <arg>                      The initial lambda value for
+                                              regularization [default:
+                                              0.01]
+ -max,--max_target <arg>                      The maximum value of target
+                                              variable
+ -maxval,--max_init_value <arg>               The maximum initial value in
+                                              the matrix V [default: 1.0]
+ -min,--min_target <arg>                      The minimum value of target
+                                              variable
+ -min_init_stddev <arg>                       The minimum standard
+                                              deviation of initial matrix
+                                              V [default: 0.1]
+ -p,--size_x <arg>                            The size of x
+ -power_t <arg>                               The exponent for inverse
+                                              scaling learning rate
+                                              [default 0.1]
+ -seed <arg>                                  Seed value [default: -1
+                                              (random)]
+ -sigma <arg>                                 The standard deviation for
+                                              initializing V [default:
+                                              0.1]
+ -t,--total_steps <arg>                       The total number of training
+                                              examples
+ -va_ratio,--validation_ratio <arg>           Ratio of training data used
+                                              for validation [default:
+                                              0.05f]
+ -va_threshold,--validation_threshold <arg>   Threshold to start
+                                              validation. At least N
+                                              training examples are used
+                                              before validation [default:
+                                              1000]
+```
+
+# Prediction
+
+```sql
+-- workaround for a bug 
+-- https://github.com/myui/hivemall/wiki/Map-side-Join-causes-ClassCastException-on-Tez:-LazyBinaryArray-cannot-be-cast-to-%5BLjava.lang.Object;
+set hive.mapjoin.optimized.hashtable=false;
+
+drop table fm_predict;
+create table fm_predict
+as
+select
+  t1.rowid,
+  fm_predict(p1.Wi, p1.Vif, t1.Xi) as predicted
+from 
+  testing_fm_exploded t1
+  LEFT OUTER JOIN fm_model p1 ON (t1.feature = p1.feature)
+group by
+  t1.rowid;
+```
+
+# Evaluation
+
+```sql
+select
+  mae(p.predicted, rating) as mae,
+  rmse(p.predicted, rating) as rmse
+from
+  testing_fm as t
+  JOIN fm_predict as p on (t.rowid = p.rowid);
+```
+
+> 0.6736798239047873 (mae)     0.858938110314545 (rmse)
+
+# Fast Factorization Machines Training using Int Features
+
+Training of Factorization Machines (FM) can be done more efficietly, in term of speed, by using INT features.
+In this section, we show how to run FM training by using int features, more specifically by using [feature hashing](https://github.com/myui/hivemall/wiki/Feature-hashing).
+
+```sql
+set hivevar:factor=10;
+set hivevar:iters=50;
+
+drop table fm_model;
+create table fm_model
+as
+select
+  feature,
+  avg(Wi) as Wi,
+  array_avg(Vif) as Vif
+from (
+  select 
+    train_fm(feature_hashing(features), rating, "-factor ${factor} -iters ${iters} -eta 0.01 -int_feature")  -- internally use a sparse map
+ -- train_fm(feature_hashing(features), rating, "-factor ${factor} -iters ${iters} -eta 0.01 -int_feature -num_features 16777216") -- internally use a dense array 
+        as (feature, Wi, Vif)
+  from 
+    training_fm
+) t
+group by feature;
+```
+
+```sql
+set hive.mapjoin.optimized.hashtable=false; -- workaround for https://issues.apache.org/jira/browse/HIVE-11051
+
+WITH predicted as (
+  select
+    t1.rowid,
+    fm_predict(p1.Wi, p1.Vif, t1.Xi) as predicted
+  from 
+    testing_fm_exploded t1
+    LEFT OUTER JOIN fm_model p1 ON (feature_hashing(t1.feature) = p1.feature)
+  group by
+    t1.rowid
+)
+select
+  mae(p.predicted, rating) as mae,
+  rmse(p.predicted, rating) as rmse
+from
+  testing_fm as t
+  JOIN predicted as p on (t.rowid = p.rowid);
+```
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/recommend/movielens_mf.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/recommend/movielens_mf.md b/docs/gitbook/recommend/movielens_mf.md
new file mode 100644
index 0000000..bc5c641
--- /dev/null
+++ b/docs/gitbook/recommend/movielens_mf.md
@@ -0,0 +1,137 @@
+This page explains how to run matrix factorization on [MovieLens 1M dataset](https://github.com/myui/hivemall/wiki/MovieLens-Dataset).
+
+*Caution:* Matrix factorization is supported in Hivemall v0.3 or later.
+
+## Calculate the mean rating in the training dataset
+```sql
+use movielens;
+
+select avg(rating) from training;
+```
+> 3.593565
+
+## Set variables (hyperparameters) for training
+```sql
+-- mean rating
+set hivevar:mu=3.593565;
+-- number of factors
+set hivevar:factor=10;
+-- maximum number of training iterations
+set hivevar:iters=50;
+```
+See [this article](https://github.com/myui/hivemall/wiki/List-of-parameters-of-Matrix-Factorization) or [OnlineMatrixFactorizationUDTF#getOption()](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/mf/OnlineMatrixFactorizationUDTF.java#L123) to get the details of options.
+
+Note that there are no need to set an exact value for $mu. It actually works without setting $mu but recommended to set one for getting a better prediction.
+
+_Due to [a bug](https://issues.apache.org/jira/browse/HIVE-8396) in Hive, do not issue comments in CLI._
+
+## Training
+```sql
+create table sgd_model
+as
+select
+  idx, 
+  array_avg(u_rank) as Pu, 
+  array_avg(m_rank) as Qi, 
+  avg(u_bias) as Bu, 
+  avg(m_bias) as Bi
+from (
+  select 
+    train_mf_sgd(userid, movieid, rating, "-factor ${factor} -mu ${mu} -iter ${iters}") as (idx, u_rank, m_rank, u_bias, m_bias)
+  from 
+    training
+) t
+group by idx;
+```
+Note: Hivemall also provides *train_mf_adagrad* for training using AdaGrad.
+
+# Predict
+
+```sql
+select
+  t2.actual,
+  mf_predict(t2.Pu, p2.Qi, t2.Bu, p2.Bi, ${mu}) as predicted
+from (
+  select
+    t1.userid, 
+    t1.movieid,
+    t1.rating as actual,
+    p1.Pu,
+    p1.Bu
+  from
+    testing t1 LEFT OUTER JOIN sgd_model p1
+    ON (t1.userid = p1.idx) 
+) t2 
+LEFT OUTER JOIN sgd_model p2
+ON (t2.movieid = p2.idx);
+```
+
+# Evaluate (computes MAE and RMSE)
+```sql
+select
+  mae(predicted, actual) as mae,
+  rmse(predicted, actual) as rmse
+from (
+  select
+    t2.actual,
+    mf_predict(t2.Pu, p2.Qi, t2.Bu, p2.Bi, ${mu}) as predicted
+  from (
+    select
+      t1.userid, 
+      t1.movieid,
+      t1.rating as actual,
+      p1.Pu,
+      p1.Bu
+    from
+      testing t1 LEFT OUTER JOIN sgd_model p1
+      ON (t1.userid = p1.idx) 
+  ) t2 
+  LEFT OUTER JOIN sgd_model p2
+  ON (t2.movieid = p2.idx)
+) t;
+```
+> 0.6728969407733578 (MAE) 
+
+> 0.8584162122694449 (RMSE)
+
+# Item Recommendation
+
+Recommend top-k movies that a user have not ever seen.
+```sql
+set hivevar:userid=1;
+set hivevar:topk=5;
+
+select
+  t1.movieid, 
+  mf_predict(t2.Pu, t1.Qi, t2.Bu, t1.Bi, ${mu}) as predicted
+from (
+  select
+    idx movieid,
+    Qi, 
+    Bi
+  from
+    sgd_model p
+  where
+    p.idx NOT IN 
+      (select movieid from training where userid=${userid})
+) t1 CROSS JOIN (
+  select
+    Pu,
+    Bu
+  from 
+    sgd_model
+  where
+    idx = ${userid}
+) t2
+order by
+  predicted DESC
+limit ${topk};
+```
+
+| movieid | predicted |
+|--------:|----------:|
+| 318     | 4.8051853 |
+| 2503    | 4.788541  |
+| 53      | 4.7518783 |
+| 904     | 4.7463417 |
+| 953     | 4.732769  |
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/recommend/news20.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/recommend/news20.md b/docs/gitbook/recommend/news20.md
new file mode 100644
index 0000000..e69de29

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/recommend/news20_bbit_minhash.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/recommend/news20_bbit_minhash.md b/docs/gitbook/recommend/news20_bbit_minhash.md
new file mode 100644
index 0000000..72ba7f2
--- /dev/null
+++ b/docs/gitbook/recommend/news20_bbit_minhash.md
@@ -0,0 +1,50 @@
+# Function Signature of bbit_minhash
+
+```
+Text bbit_minhash(array<int|string> features)
+Text bbit_minhash(array<int|string> features, int numHashes=128)
+Text bbit_minhash(array<int|string> features, boolean discardWeight=false)
+Text bbit_minhash(array<int|string> features, int numHashes=128, boolean discardWeight=false)
+```
+
+# Create a signature for each article
+
+```sql
+create table new20mc_with_signature
+as
+select
+  rowid, 
+  bbit_minhash(features, false) as signature
+from
+  news20mc_train;
+```
+
+# kNN brute-force search using b-Bit minhash
+```sql
+set hivevar:topn=10;
+
+select
+  t1.rowid, 
+  jaccard_similarity(t1.signature, q1.signature,128) as similarity
+--  , popcnt(t1.signature, q1.signature) as popcnt
+from
+  new20mc_with_signature t1 
+  CROSS JOIN 
+  (select bbit_minhash(features,128,false) as signature from news20mc_test where rowid = 1) q1
+order by
+  similarity DESC
+limit ${topn};
+```
+
+|rowid  | similarity | popcnt |
+|:------|------------|-------:|
+| 11952 | 0.390625   | 41 |
+| 10748 | 0.359375   | 41 |
+| 12902 | 0.34375    | 45 |
+| 3087  | 0.328125   | 48 |
+| 3     | 0.328125   | 37 |
+| 11493 | 0.328125   | 38 |
+| 3839  | 0.328125   | 41 |
+| 12669 | 0.328125   | 37 |
+| 13604 | 0.3125     | 41 |
+| 6333  | 0.3125     | 39 |
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/recommend/news20_jaccard.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/recommend/news20_jaccard.md b/docs/gitbook/recommend/news20_jaccard.md
new file mode 100644
index 0000000..ea6f8cc
--- /dev/null
+++ b/docs/gitbook/recommend/news20_jaccard.md
@@ -0,0 +1,123 @@
+List related (similar) articles for each article.
+
+# Preparation
+```sql
+use news20;
+
+delete jar /home/myui/tmp/hivemall.jar;
+add jar /home/myui/tmp/hivemall.jar;
+
+source /home/myui/tmp/define-all.hive;
+```
+
+# Extract clusters
+```sql
+set hivevar:hashes=100; -- Generate N sets of minhash values for each row (DEFAULT: 5)
+set hivevar:keygroups=2; -- Use K minhash value for generating a resulting value (DEFAULT: 2)
+
+create table news20_clusterid_assign
+as
+select 
+  -- minhash(rowid, features) as (clusterId, rowid)
+  minhash(rowid, features, "-n ${hashes} -k ${keygroups}") as (clusterId, rowid)
+from 
+  news20mc_train;
+
+--set hivevar:min_cluster_size=5;
+
+create or replace view news20_cluster
+as
+select
+  clusterId, 
+  collect_set(rowid) as rowids
+from 
+  news20_clusterid_assign
+group by clusterId
+-- having size(rowids) > ${min_cluster_size}
+;
+```
+
+# Get recommendations
+
+```sql
+create table news20_similar_articles
+as
+WITH t1 as (
+select
+  l.rowid,
+  r.rowid as other_id,
+  count(1) as cnt
+from
+  news20_clusterid_assign l 
+  LEFT OUTER JOIN
+  news20_clusterid_assign r
+     ON (l.clusterid = r.clusterid)
+where 
+  l.rowid != r.rowid
+group by
+  l.rowid, r.rowid
+having 
+-- 10/${hashes}=10/100=0.1 (filter by a pseudo Jaccard similarity by Minhash is greater than or equals to 0.1)
+  cnt >= 10 
+)
+select
+  rowid,
+  collect_set(other_id) as related_articles
+from 
+  t1
+group by
+  rowid
+-- order by rowid asc
+;
+```
+
+List all possible clusters w/o using a similarity threshold:
+```sql
+create table news20_similar_articles2
+as
+select
+  l.rowid,
+  collect_set(r.rowid) as related_articles
+from
+  news20_clusterid_assign l 
+  LEFT OUTER JOIN
+  news20_clusterid_assign r
+     ON (l.clusterid = r.clusterid)
+where 
+  l.rowid != r.rowid
+group by
+  l.rowid
+-- order by rowid asc
+;
+```
+
+## Jaccard similarity computation using k-Minhash
+
+```sql
+create table news20_jaccard_similarity
+as
+WITH t1 as (
+select
+  l.rowid,
+  r.rowid as other_id,
+  count(1) / ${hashes} as similarity
+from
+  news20_clusterid_assign l 
+  JOIN news20_clusterid_assign r
+     ON (l.clusterid = r.clusterid)
+where 
+  l.rowid != r.rowid
+group by
+  l.rowid, r.rowid
+)
+select
+  rowid,
+  other_id,
+  similarity,
+  1.0 - similarity as distance
+from
+  t1
+where
+  similarity >= 0.1
+;
+```
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/recommend/news20_knn.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/recommend/news20_knn.md b/docs/gitbook/recommend/news20_knn.md
new file mode 100644
index 0000000..0471a2e
--- /dev/null
+++ b/docs/gitbook/recommend/news20_knn.md
@@ -0,0 +1,103 @@
+# Extract clusters and assign N cluster IDs to each article
+```
+create or replace view news20_cluster
+as
+select 
+  minhash(rowid, features) as (clusterId, rowid)
+from 
+  news20mc_train;
+
+create table news20_with_clusterid
+as
+select 
+  t1.clusterid, 
+  t1.rowid, 
+  o1.features
+from 
+  news20_cluster t1
+  JOIN news20mc_train o1 ON (t1.rowid = o1.rowid);
+```
+
+# Query expression with cluster id
+```
+set hivevar:noWeight=false;
+
+create table extract_target_cluster
+as
+select 
+  features,
+  clusterid
+from (
+  select 
+     features,
+     minhashes(features,${noWeight}) as clusters
+  from 
+     news20mc_test 
+  where 
+     rowid = 1
+) t1
+LATERAL VIEW explode(clusters) t2 AS clusterid;
+```
+
+# kNN search using minhashing
+```sql
+set hivevar:topn=10;
+
+select 
+  t1.rowid, 
+  cosine_similarity(t1.features, q1.features, false) as similarity
+from
+  news20_with_clusterid t1
+  JOIN extract_target_cluster q1 ON (t1.clusterid = q1.clusterid)
+order by
+  similarity DESC
+limit ${topn};
+```
+
+> Time taken: 22.161 seconds
+
+|rowid  | similarity |
+|:------|-----------:|
+| 2182  | 0.21697778  |
+| 5622  | 0.21483186  |
+| 962   |  0.13240485 |
+| 12242 |  0.12158953 |
+| 5102  | 0.11168713  |
+| 8562  | 0.107470974 |
+| 14396 |0.09949879   |
+| 2542  | 0.09011547  |
+| 1645  | 0.08894014  |
+| 2862  | 0.08800333  |
+
+# Brute force kNN search (based on cosine similarity)
+```sql
+select
+  t1.rowid,
+  cosine_similarity(t1.features, q1.features) as similarity -- hive v0.3.2 or later
+  -- cosine_similarity(t1.features, q1.features, false) as similarity -- hive v0.3.1 or before
+from 
+  news20mc_train t1
+  CROSS JOIN
+  (select features from news20mc_test where rowid = 1) q1
+ORDER BY
+  similarity DESC
+limit ${topn};
+```
+
+> Time taken: 24.335 seconds
+
+|rowid  | similarity |
+|:------|-----------:|
+| 12902 | 0.47759432 |
+| 7922  | 0.4184913  |
+| 2382  | 0.21919869 |
+| 2182  | 0.21697778 |
+| 5622  | 0.21483186 |
+| 9562  | 0.21223815 |
+| 3222  | 0.164399   |
+| 11202 | 0.16439897 |
+| 10122 | 0.1620197  |
+| 8482  | 0.15229382 |
+
+
+Refer [this page](https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF#top-k-similarity-computation) for efficient top-k kNN computation.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/regression/e2006.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/regression/e2006.md b/docs/gitbook/regression/e2006.md
new file mode 100644
index 0000000..e69de29

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/regression/e2006_arow.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/regression/e2006_arow.md b/docs/gitbook/regression/e2006_arow.md
new file mode 100644
index 0000000..a3b60eb
--- /dev/null
+++ b/docs/gitbook/regression/e2006_arow.md
@@ -0,0 +1,259 @@
+http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf
+
+---
+#[PA1a]
+
+##Training
+```sql
+set mapred.reduce.tasks=64;
+drop table e2006tfidf_pa1a_model ;
+create table e2006tfidf_pa1a_model as
+select 
+ feature,
+ avg(weight) as weight
+from 
+ (select 
+     train_pa1a_regr(addBias(features),target) as (feature,weight)
+  from 
+     e2006tfidf_train_x3
+ ) t 
+group by feature;
+set mapred.reduce.tasks=-1;
+```
+_Caution: Do not use voted_avg() for regression. voted_avg() is for classification._
+
+## prediction
+```sql
+create or replace view e2006tfidf_pa1a_predict
+as
+select
+  t.rowid, 
+  sum(m.weight * t.value) as predicted
+from 
+  e2006tfidf_test_exploded t LEFT OUTER JOIN
+  e2006tfidf_pa1a_model m ON (t.feature = m.feature)
+group by
+  t.rowid;
+```
+
+## evaluation
+```sql
+drop table e2006tfidf_pa1a_submit;
+create table e2006tfidf_pa1a_submit as
+select 
+  t.target as actual, 
+  p.predicted as predicted
+from 
+  e2006tfidf_test t JOIN e2006tfidf_pa1a_predict p 
+    on (t.rowid = p.rowid);
+
+select avg(actual), avg(predicted) from e2006tfidf_pa1a_submit;
+```
+> -3.8200363760415414     -3.8869923258589476
+
+```sql
+set hivevar:mean_actual=-3.8200363760415414;
+
+select 
+   sqrt(sum(pow(predicted - actual,2.0))/count(1)) as RMSE, 
+   sum(pow(predicted - actual,2.0))/count(1) as MSE, 
+   sum(abs(predicted - actual))/count(1) as MAE,
+   1 - sum(pow(actual - predicted,2.0)) / sum(pow(actual - ${mean_actual},2.0)) as R2
+from 
+   e2006tfidf_pa1a_submit;
+```
+> 0.3797959864675519      0.14424499133686086     0.23846059576113587     0.5010367946980386
+
+---
+#[PA2a]
+
+##Training
+```sql
+set mapred.reduce.tasks=64;
+drop table e2006tfidf_pa2a_model;
+create table e2006tfidf_pa2a_model as
+select 
+ feature,
+ avg(weight) as weight
+from 
+ (select 
+     train_pa2a_regr(addBias(features),target) as (feature,weight)
+  from 
+     e2006tfidf_train_x3
+ ) t 
+group by feature;
+set mapred.reduce.tasks=-1;
+```
+
+## prediction
+```sql
+create or replace view e2006tfidf_pa2a_predict
+as
+select
+  t.rowid, 
+  sum(m.weight * t.value) as predicted
+from 
+  e2006tfidf_test_exploded t LEFT OUTER JOIN
+  e2006tfidf_pa2a_model m ON (t.feature = m.feature)
+group by
+  t.rowid;
+```
+
+## evaluation
+```sql
+drop table e2006tfidf_pa2a_submit;
+create table e2006tfidf_pa2a_submit as
+select 
+  t.target as actual, 
+  pd.predicted as predicted
+from 
+  e2006tfidf_test t JOIN e2006tfidf_pa2a_predict pd 
+    on (t.rowid = pd.rowid);
+
+select avg(actual), avg(predicted) from e2006tfidf_pa2a_submit;
+```
+> -3.8200363760415414     -3.9124877451612488
+
+```sql
+set hivevar:mean_actual=-3.8200363760415414;
+
+select 
+   sqrt(sum(pow(predicted - actual,2.0))/count(1)) as RMSE, 
+   sum(pow(predicted - actual,2.0))/count(1) as MSE, 
+   sum(abs(predicted - actual))/count(1) as MAE,
+   1 - sum(pow(actual - predicted,2.0)) / sum(pow(actual - ${mean_actual},2.0)) as R2
+from 
+   e2006tfidf_pa2a_submit;
+```
+> 0.38538660838804495     0.14852283792484033     0.2466732002711477      0.48623913673053565
+
+---
+#[AROW]
+
+##Training
+```sql
+set mapred.reduce.tasks=64;
+drop table e2006tfidf_arow_model ;
+create table e2006tfidf_arow_model as
+select 
+ feature,
+ -- avg(weight) as weight -- [hivemall v0.1]
+ argmin_kld(weight, covar) as weight -- [hivemall v0.2 or later]
+from 
+ (select 
+     -- train_arow_regr(addBias(features),target) as (feature,weight)    -- [hivemall v0.1]
+     train_arow_regr(addBias(features),target) as (feature,weight,covar) -- [hivemall v0.2 or later]
+  from 
+     e2006tfidf_train_x3
+ ) t 
+group by feature;
+set mapred.reduce.tasks=-1;
+```
+
+## prediction
+```sql
+create or replace view e2006tfidf_arow_predict
+as
+select
+  t.rowid, 
+  sum(m.weight * t.value) as predicted
+from 
+  e2006tfidf_test_exploded t LEFT OUTER JOIN
+  e2006tfidf_arow_model m ON (t.feature = m.feature)
+group by
+  t.rowid;
+```
+
+## evaluation
+```sql
+drop table e2006tfidf_arow_submit;
+create table e2006tfidf_arow_submit as
+select 
+  t.target as actual, 
+  p.predicted as predicted
+from 
+  e2006tfidf_test t JOIN e2006tfidf_arow_predict p
+    on (t.rowid = p.rowid);
+
+select avg(actual), avg(predicted) from e2006tfidf_arow_submit;
+```
+> -3.8200363760415414     -3.8692518911517433
+
+```sql
+set hivevar:mean_actual=-3.8200363760415414;
+
+select 
+   sqrt(sum(pow(predicted - actual,2.0))/count(1)) as RMSE, 
+   sum(pow(predicted - actual,2.0))/count(1) as MSE, 
+   sum(abs(predicted - actual))/count(1) as MAE,
+   1 - sum(pow(actual - predicted,2.0)) / sum(pow(actual - ${mean_actual},2.0)) as R2
+from 
+   e2006tfidf_arow_submit;
+```
+> 0.37862513029019407     0.14335698928726642     0.2368787001269389      0.5041085155590119
+
+--- 
+#[AROWe]
+AROWe is a modified version of AROW that uses Hinge loss (epsilion = 0.1)
+
+##Training
+```sql
+set mapred.reduce.tasks=64;
+drop table e2006tfidf_arowe_model ;
+create table e2006tfidf_arowe_model as
+select 
+ feature,
+ -- avg(weight) as weight -- [hivemall v0.1]
+ argmin_kld(weight, covar) as weight -- [hivemall v0.2 or later]
+from 
+ (select 
+     -- train_arowe_regr(addBias(features),target) as (feature,weight)    -- [hivemall v0.1]
+     train_arowe_regr(addBias(features),target) as (feature,weight,covar) -- [hivemall v0.2 or later]
+  from 
+     e2006tfidf_train_x3
+ ) t 
+group by feature;
+set mapred.reduce.tasks=-1;
+```
+
+## prediction
+```sql
+create or replace view e2006tfidf_arowe_predict
+as
+select
+  t.rowid, 
+  sum(m.weight * t.value) as predicted
+from 
+  e2006tfidf_test_exploded t LEFT OUTER JOIN
+  e2006tfidf_arowe_model m ON (t.feature = m.feature)
+group by
+  t.rowid;
+```
+
+## evaluation
+```sql
+drop table e2006tfidf_arowe_submit;
+create table e2006tfidf_arowe_submit as
+select 
+  t.target as actual, 
+  p.predicted as predicted
+from 
+  e2006tfidf_test t JOIN e2006tfidf_arowe_predict p
+    on (t.rowid = p.rowid);
+
+select avg(actual), avg(predicted) from e2006tfidf_arowe_submit;
+```
+> -3.8200363760415414     -3.86494905688414
+
+```sql
+set hivevar:mean_actual=-3.8200363760415414;
+
+select 
+   sqrt(sum(pow(predicted - actual,2.0))/count(1)) as RMSE, 
+   sum(pow(predicted - actual,2.0))/count(1) as MSE, 
+   sum(abs(predicted - actual))/count(1) as MAE,
+   1 - sum(pow(actual - predicted,2.0)) / sum(pow(actual - ${mean_actual},2.0)) as R2
+from 
+   e2006tfidf_arowe_submit;
+```
+> 0.37789148212861856     0.14280197226536404     0.2357339155291536      0.5060283955470721
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/regression/e2006_dataset.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/regression/e2006_dataset.md b/docs/gitbook/regression/e2006_dataset.md
new file mode 100644
index 0000000..329fb39
--- /dev/null
+++ b/docs/gitbook/regression/e2006_dataset.md
@@ -0,0 +1,72 @@
+http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf
+
+Prerequisite
+============
+* [hivemall.jar](https://github.com/myui/hivemall/tree/master/target/hivemall.jar)
+* [conv.awk](https://github.com/myui/hivemall/tree/master/scripts/misc/conv.awk)
+* [define-all.hive](https://github.com/myui/hivemall/tree/master/scripts/ddl/define-all.hive)
+
+Data preparation
+================
+
+```sh
+cd /mnt/archive/datasets/regression/E2006-tfidf
+awk -f conv.awk E2006.train > E2006.train.tsv
+awk -f conv.awk  E2006.test > E2006.test.tsv
+
+hadoop fs -mkdir -p /dataset/E2006-tfidf/train
+hadoop fs -mkdir -p /dataset/E2006-tfidf/test
+hadoop fs -put E2006.train.tsv /dataset/E2006-tfidf/train
+hadoop fs -put E2006.test.tsv /dataset/E2006-tfidf/test
+```
+
+```sql
+create database E2006;
+use E2006;
+
+delete jar /home/myui/tmp/hivemall.jar;
+add jar /home/myui/tmp/hivemall.jar;
+
+source /home/myui/tmp/define-all.hive;
+
+Create external table e2006tfidf_train (
+  rowid int,
+  target float,
+  features ARRAY<STRING>
+) 
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY "," 
+STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
+
+Create external table e2006tfidf_test (
+  rowid int, 
+  target float,
+  features ARRAY<STRING>
+) 
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY "," 
+STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/test';
+
+create table e2006tfidf_test_exploded as
+select 
+  rowid,
+  target,
+  split(feature,":")[0] as feature,
+  cast(split(feature,":")[1] as float) as value
+  -- hivemall v0.3.1 or later
+  -- extract_feature(feature) as feature,
+  -- extract_weight(feature) as value
+from 
+  e2006tfidf_test LATERAL VIEW explode(addBias(features)) t AS feature;
+```
+
+## Amplify training examples (global shuffle)
+```sql
+-- set mapred.reduce.tasks=32;
+set hivevar:seed=31;
+set hivevar:xtimes=3;
+create or replace view e2006tfidf_train_x3 as 
+select * from (
+select amplify(${xtimes}, *) as (rowid, target, features) from e2006tfidf_train
+) t
+CLUSTER BY rand(${seed});
+-- set mapred.reduce.tasks=-1;
+```
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/regression/kddcup12tr2.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/regression/kddcup12tr2.md b/docs/gitbook/regression/kddcup12tr2.md
new file mode 100644
index 0000000..e69de29

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/regression/kddcup12tr2_adagrad.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/regression/kddcup12tr2_adagrad.md b/docs/gitbook/regression/kddcup12tr2_adagrad.md
new file mode 100644
index 0000000..e6c8eb4
--- /dev/null
+++ b/docs/gitbook/regression/kddcup12tr2_adagrad.md
@@ -0,0 +1,109 @@
+_Note adagrad/adadelta is supported from hivemall v0.3b2 or later (or in the master branch)._
+
+# Preparation 
+```sql
+add jar ./tmp/hivemall-with-dependencies.jar;
+source ./tmp/define-all.hive;
+
+use kdd12track2;
+
+-- SET mapreduce.framework.name=yarn;
+-- SET hive.execution.engine=mr;
+-- SET mapreduce.framework.name=yarn-tez;
+-- SET hive.execution.engine=tez;
+SET mapred.reduce.tasks=32; -- [optional] set the explicit number of reducers to make group-by aggregation faster
+```
+
+# AdaGrad
+```sql
+drop table adagrad_model;
+create table adagrad_model 
+as
+select 
+ feature,
+ avg(weight) as weight
+from 
+ (select 
+     adagrad(features,label) as (feature,weight)
+  from 
+     training_orcfile
+ ) t 
+group by feature;
+
+drop table adagrad_predict;
+create table adagrad_predict
+  ROW FORMAT DELIMITED 
+    FIELDS TERMINATED BY "\t"
+    LINES TERMINATED BY "\n"
+  STORED AS TEXTFILE
+as
+select
+  t.rowid, 
+  sigmoid(sum(m.weight)) as prob
+from 
+  testing_exploded  t LEFT OUTER JOIN
+  adagrad_model m ON (t.feature = m.feature)
+group by 
+  t.rowid
+order by 
+  rowid ASC;
+```
+
+```sh
+hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/adagrad_predict adagrad_predict.tbl
+
+gawk -F "\t" '{print $2;}' adagrad_predict.tbl > adagrad_predict.submit
+
+pypy scoreKDD.py KDD_Track2_solution.csv adagrad_predict.submit
+```
+>AUC(SGD) : 0.739351
+
+>AUC(ADAGRAD) : 0.743279
+
+# AdaDelta
+```sql
+drop table adadelta_model;
+create table adadelta_model 
+as
+select 
+ feature,
+ cast(avg(weight) as float) as weight
+from 
+ (select 
+     adadelta(features,label) as (feature,weight)
+  from 
+     training_orcfile
+ ) t 
+group by feature;
+
+drop table adadelta_predict;
+create table adadelta_predict
+  ROW FORMAT DELIMITED 
+    FIELDS TERMINATED BY "\t"
+    LINES TERMINATED BY "\n"
+  STORED AS TEXTFILE
+as
+select
+  t.rowid, 
+  sigmoid(sum(m.weight)) as prob
+from 
+  testing_exploded  t LEFT OUTER JOIN
+  adadelta_model m ON (t.feature = m.feature)
+group by 
+  t.rowid
+order by 
+  rowid ASC;
+```
+
+```sh
+hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/adadelta_predict adadelta_predict.tbl
+
+gawk -F "\t" '{print $2;}' adadelta_predict.tbl > adadelta_predict.submit
+
+pypy scoreKDD.py KDD_Track2_solution.csv adadelta_predict.submit
+```
+>AUC(SGD) : 0.739351
+
+>AUC(ADAGRAD) : 0.743279
+
+> AUC(AdaDelta) : 0.746878
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/regression/kddcup12tr2_dataset.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/regression/kddcup12tr2_dataset.md b/docs/gitbook/regression/kddcup12tr2_dataset.md
new file mode 100644
index 0000000..8713e99
--- /dev/null
+++ b/docs/gitbook/regression/kddcup12tr2_dataset.md
@@ -0,0 +1,227 @@
+The task is predicting the click through rate (CTR) of advertisement, meaning that we are to predict the probability of each ad being clicked. 
+http://www.kddcup2012.org/c/kddcup2012-track2
+
+---
+
+**Dataset**  
+
+|    File    |    Size     |    Records   |
+|:-----------|------------:|:------------:|
+| KDD_Track2_solution.csv | 244MB | 20,297,595 (20,297,594 w/o header) |
+| descriptionid_tokensid.txt | 268MB | 3,171,830 |
+| purchasedkeywordid_tokensid.txt | 26MB | 1,249,785 |
+| queryid_tokensid.txt | 704MB | 26,243,606 |
+| test.txt | 1.3GB | 20,297,594 |
+| titleid_tokensid.txt | 171MB | 4,051,441 |
+| training.txt | 9.9GB | 149,639,105 |
+| serid_profile.txt | 283MB | 23,669,283 |
+
+![tables](https://raw.github.com/myui/hivemall/master/resources/examples/kddtrack2/tables.png)
+
+_Tokens are actually not used in this example. Try using them on your own._
+
+---
+
+```sql
+create database kdd12track2;
+use kdd12track2;
+
+delete jar /tmp/hivemall.jar;
+add jar /tmp/hivemall.jar;
+source /tmp/define-all.hive;
+
+Create external table training (
+  RowID BIGINT,
+  Clicks INT, 
+  Impression INT, 
+  DisplayURL STRING, 
+  AdID INT,
+  AdvertiserID INT, 
+  Depth SMALLINT, 
+  Position SMALLINT, 
+  QueryID INT, 
+  KeywordID INT,
+  TitleID INT, 
+  DescriptionID INT, 
+  UserID INT
+) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/kddcup2012/track2/training';
+
+Create external table testing (
+  RowID BIGINT,
+  DisplayURL STRING, 
+  AdID INT,
+  AdvertiserID INT, 
+  Depth SMALLINT, 
+  Position SMALLINT, 
+  QueryID INT, 
+  KeywordID INT,
+  TitleID INT, 
+  DescriptionID INT, 
+  UserID INT
+) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/kddcup2012/track2/testing';
+
+Create external table user (
+  UserID INT, 
+  Gender TINYINT,
+  Age TINYINT
+) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/kddcup2012/track2/user';
+
+Create external table query (
+  QueryID INT,
+  Tokens STRING
+) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/kddcup2012/track2/query';
+
+Create external table keyword (
+  KeywordID INT,
+  Tokens STRING
+) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/kddcup2012/track2/keyword';
+
+Create external table title (
+  TitleID INT, 
+  Tokens STRING
+) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/kddcup2012/track2/title';
+
+Create external table description (
+  DescriptionID INT,
+  Tokens STRING
+) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/kddcup2012/track2/description';
+
+Create external table solution (
+   RowID BIGINT,
+   Clicks INT,
+   Impressions INT,
+   Private BOOLEAN 
+) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/kddcup2012/track2/solution';
+```
+
+```sh
+gawk '{print NR"\t"$0;}' training.txt | \
+hadoop fs -put - /kddcup2012/track2/training/training.tsv
+
+gawk '{print NR"\t"$0;}' test.txt | \
+hadoop fs -put - /kddcup2012/track2/testing/test.tsv
+
+hadoop fs -put userid_profile.txt /kddcup2012/track2/user/user.tsv
+
+tail -n +2 KDD_Track2_solution.csv | sed -e 's/Public/FALSE/g' | sed -e 's/Private/TRUE/g' | gawk '{print NR","$0;}' \
+hadoop fs -put - /kddcup2012/track2/solution/solution.csv
+
+hadoop fs -put  queryid_tokensid.txt /kddcup2012/track2/query/tokensid.tsv
+hadoop fs -put purchasedkeywordid_tokensid.txt /kddcup2012/track2/keyword/tokensid.tsv
+hadoop fs -put titleid_tokensid.txt /kddcup2012/track2/title/tokensid.tsv
+hadoop fs -put descriptionid_tokensid.txt /kddcup2012/track2/description/tokensid.tsv
+```
+
+## Converting feature representation by feature hashing
+http://en.wikipedia.org/wiki/Feature_hashing
+
+*mhash* is the MurmurHash3 function to convert a feature vector into a hash value.
+
+```sql
+create or replace view training2 as
+select
+  rowid,
+  clicks,
+  (impression - clicks) as noclick,
+  mhash(concat("1:", displayurl)) as displayurl, 
+  mhash(concat("2:", adid)) as adid, 
+  mhash(concat("3:", advertiserid)) as advertiserid, 
+  mhash(concat("4:", depth)) as depth, 
+  mhash(concat("5:", position)) as position, 
+  mhash(concat("6:", queryid)) as queryid, 
+  mhash(concat("7:", keywordid)) as keywordid, 
+  mhash(concat("8:", titleid)) as titleid, 
+  mhash(concat("9:", descriptionid)) as descriptionid, 
+  mhash(concat("10:", userid)) as userid, 
+  mhash(concat("11:", COALESCE(gender,"0"))) as gender, 
+  mhash(concat("12:", COALESCE(age,"-1"))) as age, 
+  -1 as bias
+from (
+select
+  t.*,
+  u.gender,
+  u.age
+from 
+  training t 
+  LEFT OUTER JOIN user u 
+    on t.userid = u.userid
+) t;
+
+create or replace view testing2 as
+select
+  rowid, 
+  array(displayurl, adid, advertiserid, depth, position, queryid, keywordid, titleid, descriptionid, userid, gender, age, bias) 
+    as features
+from (
+select
+  rowid,
+  mhash(concat("1:", displayurl)) as displayurl, 
+  mhash(concat("2:", adid)) as adid, 
+  mhash(concat("3:", advertiserid)) as advertiserid, 
+  mhash(concat("4:", depth)) as depth, 
+  mhash(concat("5:", position)) as position, 
+  mhash(concat("6:", queryid)) as queryid, 
+  mhash(concat("7:", keywordid)) as keywordid, 
+  mhash(concat("8:", titleid)) as titleid, 
+  mhash(concat("9:", descriptionid)) as descriptionid, 
+  mhash(concat("10:", userid)) as userid, 
+  mhash(concat("11:", COALESCE(gender,"0"))) as gender, 
+  mhash(concat("12:", COALESCE(age,"-1"))) as age, 
+  -1 as bias
+from (
+select
+  t.*,
+  u.gender,
+  u.age
+from 
+  testing t 
+  LEFT OUTER JOIN user u 
+    on t.userid = u.userid
+) t1
+) t2;
+```
+
+## Compressing large training tables
+```sql
+create table training_orcfile (
+ rowid bigint,
+ label float,
+ features array<int>
+) STORED AS orc tblproperties ("orc.compress"="SNAPPY");
+```
+_Caution: Joining between training table and user table takes a long time. Consider not to use gender and age and avoid joins if your Hadoop cluster is small._
+
+[kddconv.awk](https://github.com/myui/hivemall/blob/master/resources/examples/kddtrack2/kddconv.awk)
+
+```sql
+add file /tmp/kddconv.awk;
+
+-- SET mapred.reduce.tasks=64;
+-- SET hive.auto.convert.join=false;
+
+INSERT OVERWRITE TABLE training_orcfile 
+select transform(*) 
+  ROW FORMAT DELIMITED
+     FIELDS TERMINATED BY "\t"
+     LINES TERMINATED BY "\n"
+using 'gawk -f kddconv.awk' 
+  as (rowid BIGINT, label FLOAT, features ARRAY<INT>)
+  ROW FORMAT DELIMITED
+     FIELDS TERMINATED BY "\t"
+     COLLECTION ITEMS TERMINATED BY ","
+     LINES TERMINATED BY "\n"
+from training2
+CLUSTER BY rand();
+
+-- SET mapred.reduce.tasks=-1;
+-- SET hive.auto.convert.join=true;
+
+create table testing_exploded as
+select 
+  rowid,
+  feature
+from 
+  testing2 
+  LATERAL VIEW explode(features) t AS feature;
+```
+_Caution: We recommend you to set "mapred.reduce.tasks" in the above example to partition the training_orcfile table into pieces._
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/regression/kddcup12tr2_lr.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/regression/kddcup12tr2_lr.md b/docs/gitbook/regression/kddcup12tr2_lr.md
new file mode 100644
index 0000000..0ff9b97
--- /dev/null
+++ b/docs/gitbook/regression/kddcup12tr2_lr.md
@@ -0,0 +1,141 @@
+The task is predicting the click through rate (CTR) of advertisement, meaning that we are to predict the probability of each ad being clicked.   
+http://www.kddcup2012.org/c/kddcup2012-track2
+
+_Caution: This example just shows a baseline result. Use token tables and amplifier to get better AUC score._
+
+---
+Logistic Regression
+===============
+
+## Training
+```sql
+use kdd12track2;
+
+-- set mapred.max.split.size=134217728; -- [optional] set if OOM caused at mappers on training
+-- SET mapred.max.split.size=67108864;
+select count(1) from training_orcfile;
+```
+> 235582879
+
+235582879 / 56 (mappers) = 4206837
+
+```sql
+set hivevar:total_steps=5000000;
+-- set mapred.reduce.tasks=64; -- [optional] set the explicit number of reducers to make group-by aggregation faster
+
+drop table lr_model;
+create table lr_model 
+as
+select 
+ feature,
+ cast(avg(weight) as float) as weight
+from 
+ (select 
+     logress(features, label, "-total_steps ${total_steps}") as (feature,weight)
+     -- logress(features, label) as (feature,weight)
+  from 
+     training_orcfile
+ ) t 
+group by feature;
+
+-- set mapred.max.split.size=-1; -- reset to the default value
+```
+_Note: Setting the "-total_steps" option is optional._
+
+## Prediction
+```
+drop table lr_predict;
+create table lr_predict
+  ROW FORMAT DELIMITED 
+    FIELDS TERMINATED BY "\t"
+    LINES TERMINATED BY "\n"
+  STORED AS TEXTFILE
+as
+select
+  t.rowid, 
+  sigmoid(sum(m.weight)) as prob
+from 
+  testing_exploded  t LEFT OUTER JOIN
+  lr_model m ON (t.feature = m.feature)
+group by 
+  t.rowid
+order by 
+  rowid ASC;
+```
+## Evaluation
+
+[scoreKDD.py](https://github.com/myui/hivemall/blob/master/resources/examples/kddtrack2/scoreKDD.py)
+
+```sh
+hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/lr_predict lr_predict.tbl
+
+gawk -F "\t" '{print $2;}' lr_predict.tbl > lr_predict.submit
+
+pypy scoreKDD.py KDD_Track2_solution.csv  lr_predict.submit
+```
+_Note: You can use python instead of pypy._
+
+| Measure | Score |
+|:-----------|------------:|
+| AUC  | 0.741111 |
+| NWMAE | 0.045493 |
+| WRMSE | 0.142395 |
+---
+Passive Aggressive
+===============
+
+## Training
+```
+drop table pa_model;
+create table pa_model 
+as
+select 
+ feature,
+ cast(avg(weight) as float) as weight
+from 
+ (select 
+     train_pa1a_regr(features,label) as (feature,weight)
+  from 
+     training_orcfile
+ ) t 
+group by feature;
+```
+_PA1a is recommended when using PA for regression._
+
+## Prediction
+```
+drop table pa_predict;
+create table pa_predict
+  ROW FORMAT DELIMITED 
+    FIELDS TERMINATED BY "\t"
+    LINES TERMINATED BY "\n"
+  STORED AS TEXTFILE
+as
+select
+  t.rowid, 
+  sum(m.weight) as prob
+from 
+  testing_exploded  t LEFT OUTER JOIN
+  pa_model m ON (t.feature = m.feature)
+group by 
+  t.rowid
+order by 
+  rowid ASC;
+```
+_The "prob" of PA can be used only for ranking and can have a negative value. A higher weight means much likely to be clicked. Note that AUC is sort a measure for evaluating ranking accuracy._
+
+## Evaluation
+
+```sh
+hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/pa_predict pa_predict.tbl
+
+gawk -F "\t" '{print $2;}' pa_predict.tbl > pa_predict.submit
+
+pypy scoreKDD.py KDD_Track2_solution.csv  pa_predict.submit
+```
+
+| Measure | Score |
+|:-----------|------------:|
+| AUC  | 0.739722 |
+| NWMAE | 0.049582 |
+| WRMSE | 0.143698 |
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/regression/kddcup12tr2_lr_amplify.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/regression/kddcup12tr2_lr_amplify.md b/docs/gitbook/regression/kddcup12tr2_lr_amplify.md
new file mode 100644
index 0000000..4df124e
--- /dev/null
+++ b/docs/gitbook/regression/kddcup12tr2_lr_amplify.md
@@ -0,0 +1,103 @@
+This article explains *amplify* technique that is useful for improving prediction score.
+
+Iterations are mandatory in machine learning (e.g., in [stochastic gradient descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent)) to get good prediction models. However, MapReduce is known to be not suited for iterative algorithms because IN/OUT of each MapReduce job is through HDFS.
+
+In this example, we show how Hivemall deals with this problem. We use [KDD Cup 2012, Track 2 Task](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset) as an example.
+
+**WARNING**: rand_amplify() is supported in v0.2-beta1 and later.
+
+---
+# Amplify training examples in Map phase and shuffle them in Reduce phase
+Hivemall provides the **amplify** UDTF to enumerate iteration effects in machine learning without several MapReduce steps. 
+
+The amplify function returns multiple rows for each row.
+The first argument ${xtimes} is the multiplication factor.  
+In the following examples, the multiplication factor is set to 3.
+
+```sql
+set hivevar:xtimes=3;
+
+create or replace view training_x3
+as
+select 
+  * 
+from (
+select
+   amplify(${xtimes}, *) as (rowid, label, features)
+from  
+   training_orcfile
+) t
+CLUSTER BY rand();
+```
+
+In the above example, the  [CLUSTER BY](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy#LanguageManualSortBy-SyntaxofClusterByandDistributeBy) clause distributes Map outputs to reducers using a random key for the distribution key. And then, the input records of the reducer is randomly shuffled.
+
+The multiplication of records and  the random shuffling has a similar effect to iterations.
+So, we recommend users to use an amplified view for training as follows:
+
+```sql
+create table lr_model_x3 
+as
+select 
+ feature,
+ cast(avg(weight) as float) as weight
+from 
+ (select 
+     logress(features,label) as (feature,weight)
+  from 
+     training_x3
+ ) t 
+group by feature;
+```
+
+The above query is executed by 2 MapReduce jobs as shown below:
+![amplifier](https://dl.dropboxusercontent.com/u/13123103/hivemall/amplify.png)
+[Here](https://dl.dropboxusercontent.com/u/13123103/hivemall/amplify_plan.txt) is the actual plan generated by the Hive.
+
+Using *trainning_x3*  instead of the plain training table results in higher and better AUC (0.746214) in [this](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-(regression\)) example.
+
+A problem in amplify() is that the shuffle (copy) and merge phase of the stage 1 could become a bottleneck.
+When the training table is so large that involves 100 Map tasks, the merge operator needs to merge at least 100 files by (external) merge sort! 
+
+Note that the actual bottleneck is not M/R iterations but shuffling training instance. Iteration without shuffling (as in [the Spark example](http://spark.incubator.apache.org/examples.html)) causes very slow convergence and results in requiring more iterations. Shuffling cannot be avoided even in iterative MapReduce variants.
+
+![amplify elapsed](https://dl.dropboxusercontent.com/u/13123103/hivemall/amplify_elapsed.png)
+
+---
+# Amplify and shuffle training examples in each Map task
+
+To deal with large training data, Hivemall provides **rand_amplify** UDTF that randomly shuffles input rows in a Map task.
+The rand_amplify UDTF outputs rows in a random order when the local buffer specified by ${shufflebuffersize} is filled.
+
+With rand_amplify(), the view definition of training_x3 becomes as follows:
+```sql
+set hivevar:shufflebuffersize=1000;
+
+create or replace view training_x3
+as
+select
+   rand_amplify(${xtimes}, ${shufflebuffersize}, *) as (rowid, label, features)
+from  
+   training_orcfile;
+```
+
+The training query is executed as follows:
+![Random amplify](https://dl.dropboxusercontent.com/u/13123103/hivemall/randamplify.png)  
+[Here](https://dl.dropboxusercontent.com/u/13123103/hivemall/randamplify_plan.txt) is the actual query plan.
+
+The map-local multiplication and shuffling has no bottleneck in the merge phase and the query is efficiently executed within a single MapReduce job.
+
+![rand_amplify elapsed ](https://dl.dropboxusercontent.com/u/13123103/hivemall/randamplify_elapsed.png)
+
+Using *rand_amplify* results in a better AUC (0.743392) in [this](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-(regression\)) example.
+
+---
+# Conclusion
+
+We recommend users to use *amplify()* for small training inputs and to use *rand_amplify()* for large training inputs to get a better accuracy in a reasonable training time.
+
+| Method     | ELAPSED TIME (sec) | AUC |
+|:-----------|--------------------|----:|
+| Plain | 89.718 | 0.734805 |
+| amplifier+clustered by | 479.855  | 0.746214 |
+| rand_amplifier | 116.424 | 0.743392 |
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/resources/images/hivemall-logo-color-small.png
----------------------------------------------------------------------
diff --git a/docs/gitbook/resources/images/hivemall-logo-color-small.png b/docs/gitbook/resources/images/hivemall-logo-color-small.png
new file mode 100644
index 0000000..1194236
Binary files /dev/null and b/docs/gitbook/resources/images/hivemall-logo-color-small.png differ

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/resources/images/techstack.png
----------------------------------------------------------------------
diff --git a/docs/gitbook/resources/images/techstack.png b/docs/gitbook/resources/images/techstack.png
new file mode 100644
index 0000000..fb29a3d
Binary files /dev/null and b/docs/gitbook/resources/images/techstack.png differ

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/spark/.gitkeep
----------------------------------------------------------------------
diff --git a/docs/gitbook/spark/.gitkeep b/docs/gitbook/spark/.gitkeep
new file mode 100644
index 0000000..e69de29

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/README.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/README.md b/docs/gitbook/tips/README.md
new file mode 100644
index 0000000..e69de29

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/addbias.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/addbias.md b/docs/gitbook/tips/addbias.md
new file mode 100644
index 0000000..2b11d51
--- /dev/null
+++ b/docs/gitbook/tips/addbias.md
@@ -0,0 +1,43 @@
+A trainer learns the function f(x)=y, or weights _W_, of the following form to predict a label y where x is a feature vector.
+_y=f(x)=Wx_
+
+Without a bias clause (or regularization), f(x) cannot make a hyperplane that divides (1,1) and (2,2) becuase f(x) crosses the origin point (0,0).
+
+With bias clause b, a trainer learns the following f(x).
+_f(x)=Wx+b_ 
+Then, the predicted model considers bias existing in the dataset and the predicted hyperplane does not always cross the origin.
+
+**addBias()** of Hivemall, adds a bias to a feature vector. 
+To enable a bias clause, use addBias() for **both**_(important!)_ training and test data as follows.
+The bias _b_ is a feature of "0" ("-1" in before v0.3) by the default. See [AddBiasUDF](https://github.com/myui/hivemall/blob/master/src/main/hivemall/ftvec/AddBiasUDF.java) for the detail.
+
+Note that Bias is expressed as a feature that found in all training/testing examples.
+
+# Adding a bias clause to test data
+```sql
+create table e2006tfidf_test_exploded as
+select 
+  rowid,
+  target,
+  split(feature,":")[0] as feature,
+  cast(split(feature,":")[1] as float) as value
+  -- extract_feature(feature) as feature, -- hivemall v0.3.1 or later
+  -- extract_weight(feature) as value     -- hivemall v0.3.1 or later
+from 
+  e2006tfidf_test LATERAL VIEW explode(addBias(features)) t AS feature;
+```
+
+# Adding a bias clause to training data
+```
+create table e2006tfidf_pa1a_model as
+select 
+ feature,
+ avg(weight) as weight
+from 
+ (select 
+     pa1a_regress(addBias(features),target) as (feature,weight)
+  from 
+     e2006tfidf_train_x3
+ ) t 
+group by feature;
+```
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/emr.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/emr.md b/docs/gitbook/tips/emr.md
new file mode 100644
index 0000000..4521635
--- /dev/null
+++ b/docs/gitbook/tips/emr.md
@@ -0,0 +1,182 @@
+## Prerequisite
+Learn how to use Hive with Elastic MapReduce (EMR).  
+http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive.html
+
+Before launching an EMR job, 
+* create ${s3bucket}/emr/outputs for outputs
+* optionally, create ${s3bucket}/emr/logs for logging
+* put [emr_hivemall_bootstrap.sh](https://raw.github.com/myui/hivemall/master/scripts/misc/emr_hivemall_bootstrap.sh) on ${s3bucket}/emr/conf
+
+Then, lunch an EMR job with hive in an interactive mode.
+I'm usually lunching EMR instances with cheap Spot instances through [CLI client](http://aws.amazon.com/developertools/2264) as follows:
+```
+./elastic-mapreduce --create --alive \
+ --name "Hive cluster" \
+ --hive-interactive --hive-versions latest \
+ --hive-site=s3://${s3bucket}/emr/conf/hive-site.xml \
+ --ami-version latest \
+ --instance-group master --instance-type m1.medium --instance-count 1 --bid-price 0.175 \
+ --instance-group core --instance-type m1.large --instance-count 3 --bid-price 0.35 \
+ --enable-debugging --log-uri s3n://${s3bucket}/emr/logs \
+ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if \
+   --args "instance.isMaster=true,s3://${s3bucket}/emr/conf/emr_hivemall_bootstrap.sh" --bootstrap-name "hivemall setup"
+ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia --bootstrap-name "install ganglia"
+```
+_To use YARN instead of old Hadoop, specify "[--ami-version 3.0.0](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-ami.html#ami-versions-supported)". Hivemall works on both old Hadoop and YARN._
+
+Or, lunch an interactive EMR job using the EMR GUI wizard.
+![emr-wizard](https://dl.dropboxusercontent.com/u/13123103/emr-wizard.png)
+![emr-bootstrap](https://dl.dropboxusercontent.com/u/13123103/emr-bootstrap.png)
+
+## Data preparation
+
+Put training and test data in a TSV format on Amazon S3, e.g., on ${s3bucket}/datasets/news20b/[train|test].
+
+```sql
+create database news20;
+use news20;
+
+add jar ./tmp/hivemall.jar;
+source ./tmp/define-all.hive;
+
+set hivevar:s3bucket=YOUR_BUCKET_NAME;
+
+-- The default input split size is often too large for Hivemall
+set mapred.max.split.size=67108864;
+
+Create external table news20b_train (
+  rowid int,
+  label int,
+  features ARRAY<STRING>
+) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY "," 
+STORED AS TEXTFILE LOCATION 's3n://${s3bucket}/datasets/news20b/train';
+
+Create external table news20b_test (
+  rowid int, 
+  label int,
+  features ARRAY<STRING>
+) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ","
+STORED AS TEXTFILE LOCATION 's3n://${s3bucket}/datasets/news20b/test';
+
+-- create or replace view news20b_train_x3
+-- as
+-- select 
+--  * 
+-- from (
+-- select
+--   amplify(3, *) as (rowid, label, features)
+-- from  
+--   news20b_train 
+-- ) t
+-- CLUSTER BY CAST(rand(47) * 100 as INT), CAST(rand(49) * 100 as INT), CAST(rand(50) * 100 as INT);
+
+create or replace view news20b_train_x3
+as
+select
+   rand_amplify(3, 1000, *) as (rowid, label, features)
+from  
+   news20b_train;
+
+create table news20b_test_exploded as
+select 
+  rowid,
+  label,
+  cast(split(feature,":")[0] as int) as feature,
+  cast(split(feature,":")[1] as float) as value
+from 
+  news20b_test LATERAL VIEW explode(addBias(features)) t AS feature;
+```
+
+---
+# Adaptive Regularization of Weight Vectors (AROW)
+
+## training
+```sql
+DROP TABLE news20b_arow_model1;
+CREATE EXTERNAL TABLE IF NOT EXISTS news20b_arow_model1 (
+  feature string,
+  weight float
+)
+ROW FORMAT DELIMITED 
+  FIELDS TERMINATED BY '\t'
+  LINES TERMINATED BY '\n'
+STORED AS TEXTFILE
+LOCATION 's3://${s3bucket}/emr/outputs/news20b_arow_model1';
+
+insert overwrite table news20b_arow_model1
+select 
+ feature,
+ cast(voted_avg(weight) as float) as weight
+from 
+ (select 
+     train_arow(addBias(features),label) as (feature,weight)
+  from 
+     news20b_train_x3
+ ) t 
+group by feature;
+```
+
+## prediction
+```sql
+create or replace view news20b_arow_predict1 
+as
+select
+  t.rowid, 
+  sum(m.weight * t.value) as total_weight,
+  case when sum(m.weight * t.value) > 0.0 then 1 else -1 end as label
+from 
+  news20b_test_exploded t LEFT OUTER JOIN
+  news20b_arow_model1 m ON (t.feature = m.feature)
+group by
+  t.rowid;
+```
+
+## evaluation
+```sql
+create or replace view news20b_arow_submit1 as
+select 
+  t.rowid, 
+  t.label as actual, 
+  pd.label as predicted
+from 
+  news20b_test t JOIN news20b_arow_predict1 pd 
+    on (t.rowid = pd.rowid);
+```
+
+```sql
+select count(1)/4996 from news20b_arow_submit1 
+where actual == predicted;
+```
+> 0.9659727782225781
+
+## Cleaning
+
+```sql
+drop table news20b_arow_model1;
+drop view news20b_arow_predict1;
+drop view news20b_arow_submit1;
+```
+
+---
+## Tips
+
+We recommended users to use m1.xlarge running Hivemall on EMR as follows.
+```
+./elastic-mapreduce --create --alive \
+ --name "Hive cluster" \
+ --hive-interactive --hive-versions latest \
+ --ami-version latest \
+ --instance-group master --instance-type m1.xlarge --instance-count 1 \
+ --instance-group core --instance-type m1.xlarge --instance-count 8 --bid-price 0.7 \
+ --instance-group task --instance-type m1.xlarge --instance-count 2 --bid-price 0.7 \
+ --enable-debugging --log-uri s3://mybucket/emr/logs \
+ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
+   --args "-m,mapred.child.java.opts=-Xmx1536m,-m,mapred.tasktracker.map.tasks.maximum=7,-m,mapred.tasktracker.reduce.tasks.maximum=2,-c,fs.s3n.multipart.uploads.enable=true,-c,fs.s3n.multipart.uploads.split.size=67108864" \
+ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if \
+   --args "instance.isMaster=true,s3://mybucket/emr/conf/emr_hivemall_bootstrap.sh" \
+   --bootstrap-name "hivemall setup" \
+ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia \
+   --bootstrap-name "install ganglia" \
+ --availability-zone ap-northeast-1a
+```
+Using spot instance for core/task instance groups is the best way to save your money.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/ensemble_learning.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/ensemble_learning.md b/docs/gitbook/tips/ensemble_learning.md
new file mode 100644
index 0000000..6daaf1a
--- /dev/null
+++ b/docs/gitbook/tips/ensemble_learning.md
@@ -0,0 +1,180 @@
+This example explains how to run ensemble learning in Hivemall.   
+Two heads are better than one? Let's verify it by ensemble learning.
+
+<!-- toc -->
+
+---
+
+## UDF preparation
+```sql
+delete jar /home/myui/tmp/hivemall.jar;
+add jar /home/myui/tmp/hivemall.jar;
+
+source /home/myui/tmp/define-all.hive;
+```
+
+# [Case1] Model ensemble/mixing
+
+## training
+```sql
+SET hive.exec.parallel=true;
+SET hive.exec.parallel.thread.number=8;
+SET mapred.reduce.tasks=4;
+
+drop table news20mc_ensemble_model1;
+create table news20mc_ensemble_model1 as
+select 
+ label, 
+ -- cast(feature as int) as feature, -- hivemall v0.1
+ argmin_kld(feature, covar) as feature, -- hivemall v0.2 or later
+ voted_avg(weight) as weight
+from 
+ (select 
+     -- train_multiclass_cw(addBias(features),label) as (label,feature,weight)      -- hivemall v0.1
+     train_multiclass_cw(addBias(features),label) as (label,feature,weight,covar)   -- hivemall v0.2 or later
+  from 
+     news20mc_train_x3
+  union all
+  select 
+     -- train_multiclass_arow(addBias(features),label) as (label,feature,weight)    -- hivemall v0.1
+     train_multiclass_arow(addBias(features),label) as (label,feature,weight,covar) -- hivemall v0.2 or later
+  from 
+     news20mc_train_x3
+  union all
+  select 
+     -- train_multiclass_scw(addBias(features),label) as (label,feature,weight)     -- hivemall v0.1
+     train_multiclass_scw(addBias(features),label) as (label,feature,weight,covar)  -- hivemall v0.2 or later
+  from 
+     news20mc_train_x3
+ ) t 
+group by label, feature;
+
+-- reset to the default
+SET hive.exec.parallel=false;
+SET mapred.reduce.tasks=-1;
+```
+
+## prediction
+```sql
+create or replace view news20mc_ensemble_predict1 
+as
+select 
+  rowid, 
+  m.col0 as score, 
+  m.col1 as label
+from (
+select
+   rowid, 
+   maxrow(score, label) as m
+from (
+  select
+    t.rowid,
+    m.label,
+    sum(m.weight * t.value) as score
+  from 
+    news20mc_test_exploded t LEFT OUTER JOIN
+    news20mc_ensemble_model1 m ON (t.feature = m.feature)
+  group by
+    t.rowid, m.label
+) t1
+group by rowid
+) t2;
+```
+
+## evaluation
+```sql
+create or replace view news20mc_ensemble_submit1 as
+select 
+  t.label as actual, 
+  pd.label as predicted
+from 
+  news20mc_test t JOIN news20mc_ensemble_predict1 pd 
+    on (t.rowid = pd.rowid);
+```
+
+```
+select count(1)/3993 from news20mc_ensemble_submit1 
+where actual == predicted;
+```
+
+> 0.8494866015527173
+
+## Cleaning
+
+```sql
+drop table news20mc_ensemble_model1;
+drop view news20mc_ensemble_predict1;
+drop view news20mc_ensemble_submit1;
+```
+---
+
+Unfortunately, too many cooks spoil the broth in this case :-(
+
+| Algorithm | Accuracy |
+|:-----------|------------:|
+| AROW | 0.8474830954169797 |
+| SCW2 |  0.8482344102178813 |
+| Ensemble(model) | 0.8494866015527173 |
+| CW |  0.850488354620586 |
+
+
+---
+
+# [Case2] Prediction ensemble
+
+## prediction
+```sql
+create or replace view news20mc_pred_ensemble_predict1 
+as
+select 
+  rowid, 
+  m.col1 as label
+from (
+  select
+    rowid, 
+    maxrow(cnt, label) as m
+  from (
+    select
+      rowid,
+      label,
+      count(1) as cnt
+    from (
+      select * from news20mc_arow_predict1
+      union all
+      select * from news20mc_scw2_predict1
+      union all
+      select * from news20mc_cw_predict1
+    ) t1
+    group by rowid, label
+  ) t2
+  group by rowid
+) t3;
+```
+
+## evaluation
+```sql
+create or replace view news20mc_pred_ensemble_submit1 as
+select 
+  t.label as actual, 
+  pd.label as predicted
+from 
+  news20mc_test t JOIN news20mc_pred_ensemble_predict1 pd 
+    on (t.rowid = pd.rowid);
+```
+
+```
+select count(1)/3993 from news20mc_pred_ensemble_submit1 
+where actual == predicted;
+```
+
+> 0.8499874780866516
+
+Unfortunately, too many cooks spoil the broth in this case too :-(
+
+| Algorithm | Accuracy |
+|:-----------|------------:|
+| AROW | 0.8474830954169797 |
+| SCW2 |  0.8482344102178813 |
+| Ensemble(model) | 0.8494866015527173 |
+| Ensemble(prediction) | 0.8499874780866516 |
+| CW |  0.850488354620586 |
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/general_tips.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/general_tips.md b/docs/gitbook/tips/general_tips.md
new file mode 100644
index 0000000..e69de29