You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hivemall.apache.org by takuti <gi...@git.apache.org> on 2018/05/17 06:13:15 UTC
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
Evaluation has been conducted at: [takuti/criteo-ffm](https://github.com/takuti/criteo-ffm). See the repository for detail.
As an example, I have used tiny data provided at [guestwalk/kaggle-2014-criteo](https://github.com/guestwalk/kaggle-2014-criteo) which is already preprocessed and converted into the LIBFFM format:
- Split 2,000 samples in `train.tiny.csv` to:
- 1,587 training samples `tr.sp`
- 412 validation samples `va.sp`
As a consequence, FFM model created by LIBFFM and Hivemall with the following (almost similar) configuration showed very similar training loss and accuracy as follows.
**LIBFFM**:
```
$ ./ffm-train -k 4 -t 15 -l 0.00002 -r 0.2 -s 10 ../tr.sp model
iter tr_logloss tr_time
1 1.04980 0.0
2 0.53771 0.0
3 0.50963 0.0
4 0.48980 0.1
5 0.47469 0.1
6 0.46304 0.1
7 0.45289 0.1
8 0.44400 0.1
9 0.43653 0.1
10 0.42947 0.1
11 0.42330 0.1
12 0.41727 0.1
13 0.41130 0.1
14 0.40558 0.1
15 0.40036 0.1
```
> LogLoss on validation set `va.sp`: 0.47237
**Hivemall**:
```
$ hive --hiveconf hive.root.logger=INFO,console
hive> INSERT OVERWRITE TABLE criteo.ffm_model
> SELECT
> train_ffm(features, label, '-init_v random -max_init_value 1.0 -classification -iterations 15 -factors 4 -eta 0.2 -l2norm -optimizer sgd -lambda 0.00002 -cv_rate 0.0 -disable_wi')
> FROM (
> SELECT
> features, label
> FROM
> criteo.train_vectorized
> CLUSTER BY rand(1)
> ) t
> ;
Record training examples to a file: /var/folders/rg/6mhvj7h567x_ys7brmf2bb6w0000gn/T/hivemall_fm6211397472147242886.sgmt
Iteration #2 | average loss=0.5316043797079182, current cumulative loss=843.6561505964662, previous cumulative loss=1214.5909560888044, change rate=0.30539895232450376, #trainingExamples=1587
Iteration #3 | average loss=0.5065999656968238, current cumulative loss=803.9741455608594, previous cumulative loss=843.6561505964662, change rate=0.04703575622313853, #trainingExamples=1587
Iteration #4 | average loss=0.49634490612175397, current cumulative loss=787.6993660152235, previous cumulative loss=803.9741455608594, change rate=0.0202429140731664, #trainingExamples=1587
Iteration #5 | average loss=0.48804954980765963, current cumulative loss=774.5346355447558, previous cumulative loss=787.6993660152235, change rate=0.0167128869698916, #trainingExamples=1587
Iteration #6 | average loss=0.48072518575956447, current cumulative loss=762.9108698004288, previous cumulative loss=774.5346355447558, change rate=0.015007418920848658, #trainingExamples=1587
Iteration #7 | average loss=0.47402279755334875, current cumulative loss=752.2741797171644, previous cumulative loss=762.9108698004288, change rate=0.013942244768444403, #trainingExamples=1587
Iteration #8 | average loss=0.4677507471836629, current cumulative loss=742.320435780473, previous cumulative loss=752.2741797171644, change rate=0.013231537390308698, #trainingExamples=1587
Iteration #9 | average loss=0.4618142861358177, current cumulative loss=732.8992720975427, previous cumulative loss=742.320435780473, change rate=0.012691505216375798, #trainingExamples=1587
Iteration #10 | average loss=0.4561878517855827, current cumulative loss=723.9701207837197, previous cumulative loss=732.8992720975427, change rate=0.012183326759580433, #trainingExamples=1587
Iteration #11 | average loss=0.45087834343992406, current cumulative loss=715.5439310391595, previous cumulative loss=723.9701207837197, change rate=0.01163886395675921, #trainingExamples=1587
Iteration #12 | average loss=0.4458864402438874, current cumulative loss=707.6217806670493, previous cumulative loss=715.5439310391595, change rate=0.011071508021324606, #trainingExamples=1587
Iteration #13 | average loss=0.44118468270053807, current cumulative loss=700.1600914457539, previous cumulative loss=707.6217806670493, change rate=0.010544742156271002, #trainingExamples=1587
Iteration #14 | average loss=0.4367191822212713, current cumulative loss=693.0733421851576, previous cumulative loss=700.1600914457539, change rate=0.01012161268141256, #trainingExamples=1587
Iteration #15 | average loss=0.4324248854220929, current cumulative loss=686.2582931648615, previous cumulative loss=693.0733421851576, change rate=0.009833084906727563, #trainingExamples=1587
Performed 15 iterations of 1,587 training examples on memory (thus 23,805 training updates in total)
```
> LogLoss on the same validation set: 0.47604112308042346
Note that, since we used `-l2norm` option for training, the validation samples should also be L2 normalized as: `feature_pairs(l2_normalize(t1.features), '-ffm')`
While a choice of hyper-parameters and optimizer (SGD/FTRL/AdaGrad) affects to the accuracy to some degree, I have noticed `-disable_wi` can be a more important factor on this data. If we use the liner terms to train FFM model, LogLoss on `va.sp` is significantly increased to `1.5227099483928919`.
I'm still not sure if the result is natural or be caused by a bug. Let me double-check the implementation.
---