You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hivemall.apache.org by takuti <gi...@git.apache.org> on 2018/05/17 05:41:30 UTC
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
GitHub user takuti opened a pull request:
https://github.com/apache/incubator-hivemall/pull/149
[WIP][HIVEMALL-201] Evaluate, fix and document FFM
## What changes were proposed in this pull request?
- Evaluate FFM so Hivemall replicates comparable accuracy to [LIBFFM](https://github.com/guestwalk/libffm).
- Fix its implementation if needed
- Document how to use FFM
## What type of PR is it?
Bug Fix, Improvement, Documentation
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-201
## How was this patch tested?
Unit tests and manual tests
## How to use this feature?
(To be documented)
## Checklist
- [x] Did you apply source code formatter, i.e., `mvn formatter:format`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/takuti/incubator-hivemall HIVEMALL-201
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-hivemall/pull/149.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #149
----
commit c67744cbe60711a9cb9da5c55a4157f9d107dbf3
Author: Takuya Kitazawa <k....@...>
Date: 2018-05-16T08:39:32Z
Use pre-defined constants in option description
commit d77f0161220327a4e2ef12f368f329dc56c1c941
Author: Takuya Kitazawa <k....@...>
Date: 2018-05-16T08:40:48Z
Fix mismatch between opts.addOption and cl.getOptionValue
commit c8e374b8bfb87a0ed420604aca2340c974770f50
Author: Takuya Kitazawa <k....@...>
Date: 2018-05-16T08:41:34Z
Support FFM feature format in `l1_normalize` and `l2_normalize`
----
---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
@takuti so then, better to enable l2_norm by the default and `-disable_l2norm` to disable l2 normalization. My concern is that L2 normalization performed worse for small datasets with adequate learning rate `[0.1,1.0]`.
FieldAwareFactorizationMachineUDTFTest contains several tests. It's better to find that accuracy will not be bad with new default options, enabling L2 normalization.
---
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r205390805
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineModel.java ---
@@ -399,9 +399,8 @@ public void initRandom(int factor, long seed) {
protected static final void uniformFill(final float[] a, final Random rand,
final float maxInitValue) {
final int len = a.length;
- final float basev = maxInitValue / len;
for (int i = 0; i < len; i++) {
- float v = rand.nextFloat() * basev;
+ float v = rand.nextFloat() * maxInitValue;
--- End diff --
While this modified `random` initialization is not used for classification (and only for regression), your evaluation is only for classification.
This, it's doubtful that this change contributed for improving accuracy.
---
[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
We need to remain the default hyperparameter of FM as it is for backward compatibility. I'll take care of it on merging.
---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r191317118
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineModel.java ---
@@ -92,6 +92,14 @@ protected float getW(int i) {
protected abstract void setW(@Nonnull Feature x, float nextWi);
+ protected void setW(int i, float nextWi) {
--- End diff --
ad24f38103b3951c6266382c048e0f4514dacd1e
---
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r200604967
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -351,19 +370,29 @@ private static void writeBuffer(@Nonnull ByteBuffer srcBuf, @Nonnull NioStateful
srcBuf.clear();
}
- public void train(@Nonnull final Feature[] x, final double y,
- final boolean adaptiveRegularization) throws HiveException {
+ protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException {
_model.check(x);
+ }
+
+ protected void processValidationSample(@Nonnull final Feature[] x, final double y)
+ throws HiveException {
+ if (_adaptiveRegularization) {
+ trainLambda(x, y); // adaptive regularization
+ }
+ if (_earlyStopping) {
--- End diff --
earlyStopping is better to be performed before adaptiveRegularization.
---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r191645232
--- Diff: core/src/test/java/hivemall/fm/FieldAwareFactorizationMachineUDTFTest.java ---
@@ -256,6 +256,19 @@ public void testEarlyStopping() throws HiveException, IOException {
cumulativeLoss > udtf._validationState.getCumulativeLoss());
}
+ @Test(expected = IllegalArgumentException.class)
+ public void testUnsupportedAdaptiveRegularizationOption() throws Exception {
+ TestUtils.testGenericUDTFSerialization(FieldAwareFactorizationMachineUDTF.class,
+ new ObjectInspector[] {
+ ObjectInspectorFactory.getStandardListObjectInspector(
+ PrimitiveObjectInspectorFactory.javaStringObjectInspector),
+ PrimitiveObjectInspectorFactory.javaDoubleObjectInspector,
+ ObjectInspectorUtils.getConstantObjectInspector(
+ PrimitiveObjectInspectorFactory.javaStringObjectInspector,
+ "-seed 43 -adaptive_regularization")},
+ new Object[][] {{Arrays.asList("0:1:-2", "1:2:-1"), 1.0}});
--- End diff --
Better to compare accuracy against the default regularization. In general, it should be better than the default one.
---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
```
iter tr_logloss va_logloss
1 0.49738 0.48776
2 0.47383 0.47995
3 0.46366 0.47480
4 0.45561 0.47231
5 0.44810 0.47034
6 0.44037 0.47003
7 0.43239 0.46952
8 0.42362 0.46999 <- ffm stops one va_logloss is increased but va_logloss might decrease in the next iteration
9 0.41394 0.47088 <- once
```
In 8-th iteration, `ready to stop once va_logloss increase`.
If va_logloss descreases in the 9th iteration, then continue iteration (set not ready to finish).
If va_logloss increases in the 9th iteration, then emit the current model in the 9th iteration.
---
[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
@takuti with the modified default hyperparameter of FM, the performance of FM is getting worse.
Before
> 0.6736798239047873 (mae) 0.858938110314545 (rmse)
After
> 0.6837803085633278 (mae) 0.876690912076831 (rmse)
http://hivemall.incubator.apache.org/userguide/recommend/movielens_fm.html
---
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r200611442
--- Diff: core/src/main/java/hivemall/fm/FieldAwareFactorizationMachineModel.java ---
@@ -51,11 +50,6 @@
public FieldAwareFactorizationMachineModel(@Nonnull FFMHyperParameters params) {
super(params);
this._params = params;
- if (params.useAdaGrad) {
- this._eta0 = 1.0f;
--- End diff --
better to use large default eta0 for adagrad.
---
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/incubator-hivemall/pull/149
---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r191315508
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -379,23 +379,28 @@ protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException
_model.check(x);
}
+ protected void processValidationSample(@Nonnull final Feature[] x, final double y)
+ throws HiveException {
+ if (_adaptiveRegularization) {
+ trainLambda(x, y); // adaptive regularization
--- End diff --
373144d5151d1a57c3c13059ac70cc2e2908dc44
---
[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
@myui Documentation and system tests on my laptop/EMR have been conducted, and I'm now ready for review. Could you look more deeply into the updates for merge?
---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r191309443
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineModel.java ---
@@ -92,6 +92,14 @@ protected float getW(int i) {
protected abstract void setW(@Nonnull Feature x, float nextWi);
+ protected void setW(int i, float nextWi) {
--- End diff --
No need to have `protected void setW(int i, float nextWi)` and `protected void setW(@Nonnull String j, float nextWi)` in FactorizationMachineModel.
---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r191694422
--- Diff: core/src/test/java/hivemall/fm/FieldAwareFactorizationMachineUDTFTest.java ---
@@ -256,6 +256,19 @@ public void testEarlyStopping() throws HiveException, IOException {
cumulativeLoss > udtf._validationState.getCumulativeLoss());
}
+ @Test(expected = IllegalArgumentException.class)
+ public void testUnsupportedAdaptiveRegularizationOption() throws Exception {
+ TestUtils.testGenericUDTFSerialization(FieldAwareFactorizationMachineUDTF.class,
+ new ObjectInspector[] {
+ ObjectInspectorFactory.getStandardListObjectInspector(
+ PrimitiveObjectInspectorFactory.javaStringObjectInspector),
+ PrimitiveObjectInspectorFactory.javaDoubleObjectInspector,
+ ObjectInspectorUtils.getConstantObjectInspector(
+ PrimitiveObjectInspectorFactory.javaStringObjectInspector,
+ "-seed 43 -adaptive_regularization")},
+ new Object[][] {{Arrays.asList("0:1:-2", "1:2:-1"), 1.0}});
--- End diff --
690a20032d6810a05e0b0ebb73e284b8d6ea7cb7
---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r190842171
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -563,6 +580,10 @@ protected void runTrainingIteration(int iterations) throws HiveException {
inputBuf.flip();
for (int iter = 2; iter <= iterations; iter++) {
+ if (earlyStopValidation) {
--- End diff --
better to avoid many `if (earlyStopValidation) {`.
`_validateState` can always be non-null when `if(earlyStopValidation && _validateState.isLossIncreased()` never be true.
---
[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
@takuti Sure.
---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
@takuti I advice to check 2-3 updates to investigate how gradient updates differ.
---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
This kind of behavior could often be happen and Libffm's early stopping strategy is too aggressive.
```
7 0.43239 0.46952
8 0.42362 0.46999
9 0.41394 0.45088
```
---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r190843344
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -352,9 +352,13 @@ private static void writeBuffer(@Nonnull ByteBuffer srcBuf, @Nonnull NioStateful
srcBuf.clear();
}
+ protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException {
+ _model.check(x);
+ }
+
public void train(@Nonnull final Feature[] x, final double y,
final boolean adaptiveRegularization) throws HiveException {
- _model.check(x);
+ checkInputVector(x);
try {
if (adaptiveRegularization) {
--- End diff --
I think there are no need to share `train` if `adaptiveRegularization` is always be off for FFM and `early_stopping` is always off for FM. The logic in train becomes complex.
---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
@takuti Thank you for detailed verification. Let's disable linear term by the default.
Remove `-disable_wi` and `-enable_wi` (alias `-linear_term` ) to enable linear term.
I'm not sure `-l2norm` should be enabled by default. What happens without `-l2norm` ?
---
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r200605210
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -351,19 +370,29 @@ private static void writeBuffer(@Nonnull ByteBuffer srcBuf, @Nonnull NioStateful
srcBuf.clear();
}
- public void train(@Nonnull final Feature[] x, final double y,
- final boolean adaptiveRegularization) throws HiveException {
+ protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException {
_model.check(x);
+ }
+
+ protected void processValidationSample(@Nonnull final Feature[] x, final double y)
+ throws HiveException {
+ if (_adaptiveRegularization) {
+ trainLambda(x, y); // adaptive regularization
+ }
+ if (_earlyStopping) {
+ double p = _model.predict(x);
+ double loss = _lossFunction.loss(p, y);
+ _validationState.incrLoss(loss);
+ }
+ }
+
+ public void train(@Nonnull final Feature[] x, final double y, final boolean validation)
+ throws HiveException {
+ checkInputVector(x);
--- End diff --
avoid too many virtual method call.
`_model.check(x);` is enough both for FM and FFM.
---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r191309836
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineModel.java ---
@@ -92,6 +92,14 @@ protected float getW(int i) {
protected abstract void setW(@Nonnull Feature x, float nextWi);
+ protected void setW(int i, float nextWi) {
--- End diff --
`setW(int i, float nextWi)` is no more used when avoid caching in early stopping.
---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
Note: I've extended LIBFFM code so it uses linear terms: https://github.com/takuti/criteo-ffm/commit/9aca61d93ed8f583025729206ed0dbfd54806a44 However, I cannot observe significant difference between LogLoss achieved with/without linear terms.
---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
BTW, it might be better to implement `early stopping` using validation data.
https://github.com/guestwalk/libffm
We can use a similar approaches to `_validationRatio` used in `FactorizationMachineUDTF` instead of preparing validation dataset.
---
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r200093417
--- Diff: core/src/main/java/hivemall/fm/FieldAwareFactorizationMachineModel.java ---
@@ -259,9 +255,9 @@ protected final float eta(@Nonnull final Entry theta, final long t, final float
protected final float eta(@Nonnull final Entry theta, @Nonnegative final int f, final long t,
final float grad) {
if (_useAdaGrad) {
- double gg = theta.getSumOfSquaredGradients(f);
--- End diff --
@takuti This behavior (that used in libffm) is wrong in strict sense and previous code is much better because initial eta should equals to `eta0` but this implementation depends on the initial gradient.
---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
@takuti
Linear term is not used in Libffm implementation. Better to do research about other FFM impl as well.
https://github.com/chenhuang-learn/ffm/blob/master/ffm/src/ffm/FFMModel.java
https://github.com/superclocks/ffm/blob/master/libffm-ftrl-1.13/ffm-train.cpp
https://github.com/chenhuang-learn/ffm
https://github.com/gaterslebenchen/JLibFFM/
https://github.com/yuantiku/ytk-learn
https://github.com/RTBHOUSE/cuda-ffm/ (modified version of FFM)
The default Initial learning rate effects largely to convergence as well.
For instance-wise normalization, better to follow discussions in
https://markmail.org/message/jwtr5xygfutl55oz
It performs well only when all feature are categorical. For his dataset, instance-wise l2 normalization performed very badly...
https://gist.github.com/myui/aaeef548a17eb90c4e88f824c3ca1bcd
---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
It might be better to reconsider `eta0` when enabling `l2norm` by the default and by enlarging`max_init_size`. In my experience for FM, init random size should be small when the avg feature dimension is large (gradients will be large).
I think `1.0` is too aggressive for the default though. `0.2` or `0.5`? Better to research other implementations.
---
[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
revising in https://github.com/myui/incubator-hivemall/commits/HIVEMALL-201-2
---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r191308798
--- Diff: core/src/main/java/hivemall/fm/FMArrayModel.java ---
@@ -80,6 +80,11 @@ public float getW(@Nonnull final Feature x) {
@Override
protected void setW(@Nonnull Feature x, float nextWi) {
int i = x.getFeatureIndex();
+ setW(i, nextWi);
--- End diff --
better to avoid method call.
---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
Make sense as a compromise in terms of memory consumption.
I'll note on documentation to clarify the fact that our `-early_stopping` option does not return the best of the best model; users expect the option returns the best model achieved at the 7-th iteration, but our UDF does not behave so as discussed above.
---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r191145436
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -563,6 +580,10 @@ protected void runTrainingIteration(int iterations) throws HiveException {
inputBuf.flip();
for (int iter = 2; iter <= iterations; iter++) {
+ if (earlyStopValidation) {
--- End diff --
`_validateState = NULL` allows us to confirm if at least one validation sample exists && early stopping option is enabled. Thanks to this implementation, we can avoid unnecessary call of `cacheCurrentModel` which consumes extra memory to hold the best model parameters.
---
[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
> -lambda 0.0001 (default), -init_v adjusted_random
0.6756640217829124 0.8644404496920104
> -lambda 0.001, -init_v adjusted_random
0.6749224090640931 0.8642914100412997
> -lambda 0.002, -init_v adjusted_random
0.6729486759257253 0.862249033512779
> -lambda 0.01, -init_v adjusted_random
0.6728088660666263 0.8568219312625348
• libfm
```
eta=0.1
init_stdev=0.1
reg0 = 0.0;
regw = 0.0;
regv = 0.0;
```
https://github.com/srendle/libfm/blob/4ba0e0d5646da5d00701d853d19fbbe9b236cfd7/src/libfm/libfm.cpp#L87
https://github.com/srendle/libfm/blob/30b9c799c41d043f31565cbf827bf41d0dc3e2ab/src/fm_core/fm_model.h#L73
• libffm
```
eta = 0.1; // learning rate
lambda = 0.00002; // regularization parameter
nr_iters = 15;
k = 4; // number of latent factors
```
https://github.com/srendle/libfm/blob/4ba0e0d5646da5d00701d853d19fbbe9b236cfd7/src/libfm/libfm.cpp#L84
---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r191149655
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -352,9 +352,13 @@ private static void writeBuffer(@Nonnull ByteBuffer srcBuf, @Nonnull NioStateful
srcBuf.clear();
}
+ protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException {
+ _model.check(x);
+ }
+
public void train(@Nonnull final Feature[] x, final double y,
final boolean adaptiveRegularization) throws HiveException {
- _model.check(x);
+ checkInputVector(x);
try {
if (adaptiveRegularization) {
--- End diff --
Well...the difference is only 3 lines:
```java
if (_adaptiveRegularization) {
trainLambda(x, y); // adaptive regularization
}
```
Since FFM explicitly inherits FM code and shares many options, I just tried to remove duplicated codes between them as much as possible. Both FM and FFM use some training samples for validation in the same manner; the code should clearly show the fact.
Alternative idea is like this: 5fbcc017e9557a10275811888437afdf6c4a0ad7
---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
Evaluation has been conducted at: [takuti/criteo-ffm](https://github.com/takuti/criteo-ffm). See the repository for detail.
As an example, I have used tiny data provided at [guestwalk/kaggle-2014-criteo](https://github.com/guestwalk/kaggle-2014-criteo) which is already preprocessed and converted into the LIBFFM format:
- Split 2,000 samples in `train.tiny.csv` to:
- 1,587 training samples `tr.sp`
- 412 validation samples `va.sp`
As a consequence, FFM model created by LIBFFM and Hivemall with the following (almost similar) configuration showed very similar training loss and accuracy as follows.
**LIBFFM**:
```
$ ./ffm-train -k 4 -t 15 -l 0.00002 -r 0.2 -s 10 ../tr.sp model
iter tr_logloss tr_time
1 1.04980 0.0
2 0.53771 0.0
3 0.50963 0.0
4 0.48980 0.1
5 0.47469 0.1
6 0.46304 0.1
7 0.45289 0.1
8 0.44400 0.1
9 0.43653 0.1
10 0.42947 0.1
11 0.42330 0.1
12 0.41727 0.1
13 0.41130 0.1
14 0.40558 0.1
15 0.40036 0.1
```
> LogLoss on validation set `va.sp`: 0.47237
**Hivemall**:
```
$ hive --hiveconf hive.root.logger=INFO,console
hive> INSERT OVERWRITE TABLE criteo.ffm_model
> SELECT
> train_ffm(features, label, '-init_v random -max_init_value 1.0 -classification -iterations 15 -factors 4 -eta 0.2 -l2norm -optimizer sgd -lambda 0.00002 -cv_rate 0.0 -disable_wi')
> FROM (
> SELECT
> features, label
> FROM
> criteo.train_vectorized
> CLUSTER BY rand(1)
> ) t
> ;
Record training examples to a file: /var/folders/rg/6mhvj7h567x_ys7brmf2bb6w0000gn/T/hivemall_fm6211397472147242886.sgmt
Iteration #2 | average loss=0.5316043797079182, current cumulative loss=843.6561505964662, previous cumulative loss=1214.5909560888044, change rate=0.30539895232450376, #trainingExamples=1587
Iteration #3 | average loss=0.5065999656968238, current cumulative loss=803.9741455608594, previous cumulative loss=843.6561505964662, change rate=0.04703575622313853, #trainingExamples=1587
Iteration #4 | average loss=0.49634490612175397, current cumulative loss=787.6993660152235, previous cumulative loss=803.9741455608594, change rate=0.0202429140731664, #trainingExamples=1587
Iteration #5 | average loss=0.48804954980765963, current cumulative loss=774.5346355447558, previous cumulative loss=787.6993660152235, change rate=0.0167128869698916, #trainingExamples=1587
Iteration #6 | average loss=0.48072518575956447, current cumulative loss=762.9108698004288, previous cumulative loss=774.5346355447558, change rate=0.015007418920848658, #trainingExamples=1587
Iteration #7 | average loss=0.47402279755334875, current cumulative loss=752.2741797171644, previous cumulative loss=762.9108698004288, change rate=0.013942244768444403, #trainingExamples=1587
Iteration #8 | average loss=0.4677507471836629, current cumulative loss=742.320435780473, previous cumulative loss=752.2741797171644, change rate=0.013231537390308698, #trainingExamples=1587
Iteration #9 | average loss=0.4618142861358177, current cumulative loss=732.8992720975427, previous cumulative loss=742.320435780473, change rate=0.012691505216375798, #trainingExamples=1587
Iteration #10 | average loss=0.4561878517855827, current cumulative loss=723.9701207837197, previous cumulative loss=732.8992720975427, change rate=0.012183326759580433, #trainingExamples=1587
Iteration #11 | average loss=0.45087834343992406, current cumulative loss=715.5439310391595, previous cumulative loss=723.9701207837197, change rate=0.01163886395675921, #trainingExamples=1587
Iteration #12 | average loss=0.4458864402438874, current cumulative loss=707.6217806670493, previous cumulative loss=715.5439310391595, change rate=0.011071508021324606, #trainingExamples=1587
Iteration #13 | average loss=0.44118468270053807, current cumulative loss=700.1600914457539, previous cumulative loss=707.6217806670493, change rate=0.010544742156271002, #trainingExamples=1587
Iteration #14 | average loss=0.4367191822212713, current cumulative loss=693.0733421851576, previous cumulative loss=700.1600914457539, change rate=0.01012161268141256, #trainingExamples=1587
Iteration #15 | average loss=0.4324248854220929, current cumulative loss=686.2582931648615, previous cumulative loss=693.0733421851576, change rate=0.009833084906727563, #trainingExamples=1587
Performed 15 iterations of 1,587 training examples on memory (thus 23,805 training updates in total)
```
> LogLoss on the same validation set: 0.47604112308042346
Note that, since we used `-l2norm` option for training, the validation samples should also be L2 normalized as: `feature_pairs(l2_normalize(t1.features), '-ffm')`
While a choice of hyper-parameters and optimizer (SGD/FTRL/AdaGrad) affects to the accuracy to some degree, I have noticed `-disable_wi` can be a more important factor on this data. If we use the liner terms to train FFM model, LogLoss on `va.sp` is significantly increased to `1.5227099483928919`.
I'm still not sure if the result is natural or be caused by a bug. Let me double-check the implementation.
---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
### With linear terms
#### Hivemall
```sql
INSERT OVERWRITE TABLE criteo.ffm_model
SELECT
train_ffm(features, label, '-init_v random -max_init_value 0.5 -classification -iterations 15 -factors 4 -eta 0.2 -l2norm -optimizer adagrad -lambda 0.00002 -cv_rate 0.0')
FROM (
SELECT
features, label
FROM
criteo.train_vectorized
CLUSTER BY rand(1)
) t
;
```
```
Iteration #2 | average loss=0.474651712453725, current cumulative loss=753.2722676640616, previous cumulative loss=990.2550021169766, change rate=0.23931485722999737, #trainingExamples=1587
Iteration #3 | average loss=0.4499051385165006, current cumulative loss=713.9994548256865, previous cumulative loss=753.2722676640616, change rate=0.05213627863954456, #trainingExamples=1587
Iteration #4 | average loss=0.4342257595710771, current cumulative loss=689.1162804392994, previous cumulative loss=713.9994548256865, change rate=0.03485041090467212, #trainingExamples=1587
Iteration #5 | average loss=0.4225120903723549, current cumulative loss=670.5266874209271, previous cumulative loss=689.1162804392994, change rate=0.026975988735198287, #trainingExamples=1587
Iteration #6 | average loss=0.41300825971798527, current cumulative loss=655.4441081724426, previous cumulative loss=670.5266874209271, change rate=0.022493630054453533, #trainingExamples=1587
Iteration #7 | average loss=0.40491514701335013, current cumulative loss=642.6003383101867, previous cumulative loss=655.4441081724426, change rate=0.019595522641995967, #trainingExamples=1587
Iteration #8 | average loss=0.3978014571916465, current cumulative loss=631.310912563143, previous cumulative loss=642.6003383101867, change rate=0.017568347033135524, #trainingExamples=1587
Iteration #9 | average loss=0.3914067263636397, current cumulative loss=621.1624747390962, previous cumulative loss=631.310912563143, change rate=0.016075182009517044, #trainingExamples=1587
Iteration #10 | average loss=0.3855609819906249, current cumulative loss=611.8852784191217, previous cumulative loss=621.1624747390962, change rate=0.014935216947661086, #trainingExamples=1587
Iteration #11 | average loss=0.3801467153362753, current cumulative loss=603.2928372386689, previous cumulative loss=611.8852784191217, change rate=0.01404256889894858, #trainingExamples=1587
Iteration #12 | average loss=0.3750791243746283, current cumulative loss=595.2505703825351, previous cumulative loss=603.2928372386689, change rate=0.01333061883005943, #trainingExamples=1587
Iteration #13 | average loss=0.37029474458756273, current cumulative loss=587.657759660462, previous cumulative loss=595.2505703825351, change rate=0.012755654676976761, #trainingExamples=1587
Iteration #14 | average loss=0.36574472099268607, current cumulative loss=580.4368722153928, previous cumulative loss=587.657759660462, change rate=0.012287572700888608, #trainingExamples=1587
Iteration #15 | average loss=0.3613904840032808, current cumulative loss=573.5266981132066, previous cumulative loss=580.4368722153928, change rate=0.011905126005885216, #trainingExamples=1587
Performed 15 iterations of 1,587 training examples on memory (thus 23,805 training updates in total)
```
> LogLoss: 0.4771035166468042
#### LIBFFM
```
$ ./ffm-train -k 4 -t 15 -l 0.00002 -r 0.2 -s 1 ../tr.sp model
First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (0.0 seconds)
iter tr_logloss tr_time
1 0.62043 0.0
2 0.47533 0.1
3 0.44968 0.1
4 0.43548 0.2
5 0.42261 0.2
6 0.41322 0.3
7 0.40489 0.3
8 0.39687 0.4
9 0.39085 0.4
10 0.38530 0.4
11 0.37965 0.5
12 0.37450 0.5
13 0.36937 0.6
14 0.36444 0.6
15 0.36031 0.7
$ ./ffm-predict ../va.sp model submission.csv
logloss = 0.47818
```
### Without linear terms (i.e., adding `-disable_wi` option)
#### Hivemall
```
Iteration #2 | average loss=0.539961924393562, current cumulative loss=856.919574012583, previous cumulative loss=1651.6985545424677, change rate=0.48118888179934516, #trainingExamples=1587
Iteration #3 | average loss=0.5106114115327627, current cumulative loss=810.3403101024943, previous cumulative loss=856.919574012583, change rate=0.05435663430113771, #trainingExamples=1587
Iteration #4 | average loss=0.4906722901321148, current cumulative loss=778.6969244396662, previous cumulative loss=810.3403101024943, change rate=0.03904950212686045, #trainingExamples=1587
Iteration #5 | average loss=0.4754916462118607, current cumulative loss=754.6052425382229, previous cumulative loss=778.6969244396662, change rate=0.030938457755922362, #trainingExamples=1587
Iteration #6 | average loss=0.46330291728471334, current cumulative loss=735.2617297308401, previous cumulative loss=754.6052425382229, change rate=0.025633949669257704, #trainingExamples=1587
Iteration #7 | average loss=0.453140805287918, current cumulative loss=719.1344579919258, previous cumulative loss=735.2617297308401, change rate=0.021934055706691043, #trainingExamples=1587
Iteration #8 | average loss=0.44439540937886607, current cumulative loss=705.2555146842604, previous cumulative loss=719.1344579919258, change rate=0.019299510895946, #trainingExamples=1587
Iteration #9 | average loss=0.4366611986545602, current cumulative loss=692.9813222647871, previous cumulative loss=705.2555146842604, change rate=0.017403894282157387, #trainingExamples=1587
Iteration #11 | average loss=0.42321511843877446, current cumulative loss=671.6423929623351, previous cumulative loss=681.8770641514493, change rate=0.015009554840872389, #trainingExamples=1587
Iteration #12 | average loss=0.4171781468097722, current cumulative loss=662.0617189871085, previous cumulative loss=671.6423929623351, change rate=0.01426454624606136, #trainingExamples=1587
Iteration #13 | average loss=0.411451696404218, current cumulative loss=652.973842193494, previous cumulative loss=662.0617189871085, change rate=0.013726630815504848, #trainingExamples=1587
Iteration #14 | average loss=0.40595767772793845, current cumulative loss=644.2548345542383, previous cumulative loss=652.973842193494, change rate=0.013352767103145282, #trainingExamples=1587
Iteration #15 | average loss=0.4006353270154049, current cumulative loss=635.8082639734475, previous cumulative loss=644.2548345542383, change rate=0.013110604884532947, #trainingExamples=1587
Performed 15 iterations of 1,587 training examples on memory (thus 23,805 training updates in total)
```
> LogLoss: 0.4757278678816663
#### LIBFFFM
```
$ ./ffm-train -k 4 -t 15 -l 0.00002 -r 0.2 -s 1 --disable-wi ../tr.sp model
First check if the text file has already been converted to binary format (0.0 seconds)
Binary file found. Skip converting text to binary
iter tr_logloss tr_time
1 1.03199 0.1
2 0.53894 0.1
3 0.51018 0.1
4 0.49096 0.2
5 0.47549 0.2
6 0.46334 0.3
7 0.45313 0.3
8 0.44405 0.3
9 0.43662 0.4
10 0.42985 0.4
11 0.42337 0.5
12 0.41732 0.5
13 0.41140 0.6
14 0.40583 0.6
15 0.40049 0.6
$ ./ffm-predict ../va.sp model submission.csv
logloss = 0.47284
```
FFM w/o linear terms works slightly better in both Hivemall and LIBFFM.
---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
I'll change default options and consider to implement early stopping option as you suggested.
> What happens without `-l2norm` ?
Once we drop instance-wise L2 normalization, a model easily overfits to training samples, and prediction accuracy gets exceptionally worse.
**LIBFFM**:
```
$ ./ffm-train -k 4 -t 15 -l 0.00002 -r 0.2 -s 1 --no-norm ../tr.sp model
First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (0.0 seconds)
iter tr_logloss tr_time
1 4.24374 0.0
2 0.53960 0.1
3 0.09525 0.2
4 0.01288 0.2
5 0.00215 0.3
6 0.00133 0.3
7 0.00112 0.3
8 0.00098 0.4
9 0.00089 0.4
10 0.00082 0.5
11 0.00076 0.5
12 0.00072 0.6
13 0.00068 0.6
14 0.00064 0.6
15 0.00061 0.7
$ ./ffm-predict ../va.sp model submission.csv
logloss = 1.75623
```
**Hivemall**:
```
Iteration #2 | average loss=0.5186307939402891, current cumulative loss=823.0670699832388, previous cumulative loss=6640.3299608989755, change rate=0.876050275388452, #trainingExamples=1587
Iteration #3 | average loss=0.06870252595245425, current cumulative loss=109.0309086865449, previous cumulative loss=823.0670699832388, change rate=0.8675309550547743, #trainingExamples=1587
Iteration #4 | average loss=0.01701292407900819, current cumulative loss=26.999510513386, previous cumulative loss=109.0309086865449, change rate=0.7523682886014696, #trainingExamples=1587
Iteration #5 | average loss=0.003132377872105223, current cumulative loss=4.971083683030989, previous cumulative loss=26.999510513386, change rate=0.8158824516256917, #trainingExamples=1587
Iteration #6 | average loss=0.001693780516846469, current cumulative loss=2.6880296802353465, previous cumulative loss=4.971083683030989, change rate=0.4592668617888987, #trainingExamples=1587
Iteration #7 | average loss=0.0013357168592237345, current cumulative loss=2.1197826555880668, previous cumulative loss=2.6880296802353465, change rate=0.21139908864307172, #trainingExamples=1587
Iteration #8 | average loss=0.0011459013923848537, current cumulative loss=1.8185455097147627, previous cumulative loss=2.1197826555880668, change rate=0.1421075623386188, #trainingExamples=1587
Iteration #9 | average loss=0.001017751388111345, current cumulative loss=1.6151714529327046, previous cumulative loss=1.8185455097147627, change rate=0.11183336116452601, #trainingExamples=1587
Iteration #10 | average loss=9.230266490923267E-4, current cumulative loss=1.4648432921095225, previous cumulative loss=1.6151714529327046, change rate=0.0930725716766649, #trainingExamples=1587
Iteration #11 | average loss=8.493080071393429E-4, current cumulative loss=1.3478518073301373, previous cumulative loss=1.4648432921095225, change rate=0.07986621190783184, #trainingExamples=1587
Iteration #12 | average loss=7.898623710141035E-4, current cumulative loss=1.2535115827993821, previous cumulative loss=1.3478518073301373, change rate=0.0699930244687856, #trainingExamples=1587
Iteration #13 | average loss=7.406521210973545E-4, current cumulative loss=1.1754149161815017, previous cumulative loss=1.2535115827993821, change rate=0.06230230951952787, #trainingExamples=1587
Iteration #14 | average loss=6.990685420175246E-4, current cumulative loss=1.1094217761818115, previous cumulative loss=1.1754149161815017, change rate=0.056144548696113294, #trainingExamples=1587
Iteration #15 | average loss=6.633493164996776E-4, current cumulative loss=1.0527353652849885, previous cumulative loss=1.1094217761818115, change rate=0.051095455410939475, #trainingExamples=1587
Performed 15 iterations of 1,587 training examples on memory (thus 23,805 training updates in total)
```
```
LogLoss: 1.8970086009757248
```
---
[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r191298514
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -379,23 +379,28 @@ protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException
_model.check(x);
}
+ protected void processValidationSample(@Nonnull final Feature[] x, final double y)
+ throws HiveException {
+ if (_adaptiveRegularization) {
+ trainLambda(x, y); // adaptive regularization
--- End diff --
`FFM fully ignores adaptive regularization option` is expected behavior.
Not tested AdaptiveRegularization with FFM and/or FTRL.
---
[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
Also, it's better to revise default `-iters` from 1 to 10 (at least 10 iterations with early stopping).
---
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r200590772
--- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
@@ -283,9 +293,16 @@ public void process(Object[] args) throws HiveException {
}
++_t;
- recordTrain(x, y);
- boolean adaptiveRegularization = (_va_rand != null) && _t >= _validationThreshold;
- train(x, y, adaptiveRegularization);
+
+ boolean validation = false;
+ if ((_va_rand != null) && _t >= _validationThreshold) {
+ final float rnd = _va_rand.nextFloat();
+ validation = rnd < _validationRatio;
+ }
+
+ recordTrain(x, y, validation);
+
+ train(x, y, validation);
--- End diff --
Validation examples are fixed in this implementation. Also, not using non-validation examples for regularization is a bad strategy.
---
[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...
Posted by myui <gi...@git.apache.org>.
Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/149#discussion_r211470802
--- Diff: core/src/main/java/hivemall/fm/FieldAwareFactorizationMachineModel.java ---
@@ -123,17 +117,18 @@ void updateWi(final double dloss, @Nonnull final Feature x, final long t) {
}
final double Xi = x.getValue();
- float gradWi = (float) (dloss * Xi);
final Entry theta = getEntryW(x);
float wi = theta.getW();
- final float eta = eta(theta, t, gradWi);
- float nextWi = wi - eta * (gradWi + 2.f * _lambdaW * wi);
+ float grad = (float) (dloss * Xi + 2.f * _lambdaW * wi);
--- End diff --
regularization should not be performed here (?)
---