You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hivemall.apache.org by takuti <gi...@git.apache.org> on 2018/05/17 05:41:30 UTC

[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

GitHub user takuti opened a pull request:

    https://github.com/apache/incubator-hivemall/pull/149

    [WIP][HIVEMALL-201] Evaluate, fix and document FFM

    ## What changes were proposed in this pull request?
    
    - Evaluate FFM so Hivemall replicates comparable accuracy to [LIBFFM](https://github.com/guestwalk/libffm).
    - Fix its implementation if needed
    - Document how to use FFM
    
    ## What type of PR is it?
    
    Bug Fix, Improvement, Documentation
    
    ## What is the Jira issue?
    
    https://issues.apache.org/jira/browse/HIVEMALL-201
    
    ## How was this patch tested?
    
    Unit tests and manual tests
    
    ## How to use this feature?
    
    (To be documented)
    
    ## Checklist
    
    - [x] Did you apply source code formatter, i.e., `mvn formatter:format`, for your commit?
    - [ ] Did you run system tests on Hive (or Spark)?


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/takuti/incubator-hivemall HIVEMALL-201

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hivemall/pull/149.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #149
    
----
commit c67744cbe60711a9cb9da5c55a4157f9d107dbf3
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-16T08:39:32Z

    Use pre-defined constants in option description

commit d77f0161220327a4e2ef12f368f329dc56c1c941
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-16T08:40:48Z

    Fix mismatch between opts.addOption and cl.getOptionValue

commit c8e374b8bfb87a0ed420604aca2340c974770f50
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-16T08:41:34Z

    Support FFM feature format in `l1_normalize` and `l2_normalize`

----


---

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    @takuti so then, better to enable l2_norm by the default and `-disable_l2norm` to disable l2 normalization. My concern is that L2 normalization performed worse for small datasets with adequate learning rate `[0.1,1.0]`. 
    
    FieldAwareFactorizationMachineUDTFTest contains several tests. It's better to find that accuracy will not be bad with new default options, enabling L2 normalization.


---

[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r205390805
  
    --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineModel.java ---
    @@ -399,9 +399,8 @@ public void initRandom(int factor, long seed) {
         protected static final void uniformFill(final float[] a, final Random rand,
                 final float maxInitValue) {
             final int len = a.length;
    -        final float basev = maxInitValue / len;
             for (int i = 0; i < len; i++) {
    -            float v = rand.nextFloat() * basev;
    +            float v = rand.nextFloat() * maxInitValue;
    --- End diff --
    
    While this modified `random` initialization is not used for classification (and only for regression), your evaluation is only for classification. 
    
    This, it's doubtful that this change contributed for improving accuracy.


---

[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    We need to remain the default hyperparameter of FM as it is for backward compatibility. I'll take care of it on merging.


---

[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

Posted by takuti <gi...@git.apache.org>.

Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r191317118
  
    --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineModel.java ---
    @@ -92,6 +92,14 @@ protected float getW(int i) {
     
         protected abstract void setW(@Nonnull Feature x, float nextWi);
     
    +    protected void setW(int i, float nextWi) {
    --- End diff --
    
    ad24f38103b3951c6266382c048e0f4514dacd1e


---

[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r200604967
  
    --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
    @@ -351,19 +370,29 @@ private static void writeBuffer(@Nonnull ByteBuffer srcBuf, @Nonnull NioStateful
             srcBuf.clear();
         }
     
    -    public void train(@Nonnull final Feature[] x, final double y,
    -            final boolean adaptiveRegularization) throws HiveException {
    +    protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException {
             _model.check(x);
    +    }
    +
    +    protected void processValidationSample(@Nonnull final Feature[] x, final double y)
    +            throws HiveException {
    +        if (_adaptiveRegularization) {
    +            trainLambda(x, y); // adaptive regularization
    +        }
    +        if (_earlyStopping) {
    --- End diff --
    
    earlyStopping is better to be performed before adaptiveRegularization.


---

[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r191645232
  
    --- Diff: core/src/test/java/hivemall/fm/FieldAwareFactorizationMachineUDTFTest.java ---
    @@ -256,6 +256,19 @@ public void testEarlyStopping() throws HiveException, IOException {
                 cumulativeLoss > udtf._validationState.getCumulativeLoss());
         }
     
    +    @Test(expected = IllegalArgumentException.class)
    +    public void testUnsupportedAdaptiveRegularizationOption() throws Exception {
    +        TestUtils.testGenericUDTFSerialization(FieldAwareFactorizationMachineUDTF.class,
    +            new ObjectInspector[] {
    +                    ObjectInspectorFactory.getStandardListObjectInspector(
    +                        PrimitiveObjectInspectorFactory.javaStringObjectInspector),
    +                    PrimitiveObjectInspectorFactory.javaDoubleObjectInspector,
    +                    ObjectInspectorUtils.getConstantObjectInspector(
    +                        PrimitiveObjectInspectorFactory.javaStringObjectInspector,
    +                        "-seed 43 -adaptive_regularization")},
    +            new Object[][] {{Arrays.asList("0:1:-2", "1:2:-1"), 1.0}});
    --- End diff --
    
    Better to compare accuracy against the default regularization. In general, it should be better than the default one.


---

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    ```
        iter   tr_logloss   va_logloss
           1      0.49738      0.48776
           2      0.47383      0.47995
           3      0.46366      0.47480
           4      0.45561      0.47231
           5      0.44810      0.47034
           6      0.44037      0.47003
           7      0.43239      0.46952
           8      0.42362      0.46999 <- ffm stops one va_logloss is increased but va_logloss might decrease in the next iteration
           9      0.41394      0.47088 <- once 
    ```
    
    In 8-th iteration, `ready to stop once va_logloss increase`. 
    If va_logloss descreases in the 9th iteration, then continue iteration (set not ready to finish).
    If va_logloss increases in the 9th iteration, then emit the current model  in the 9th iteration.


---

[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    @takuti with the modified default hyperparameter of FM, the performance of FM is getting worse.
    
    Before 
    > 0.6736798239047873 (mae) 0.858938110314545 (rmse)
    
    After
    > 0.6837803085633278 (mae) 0.876690912076831 (rmse)
    
    http://hivemall.incubator.apache.org/userguide/recommend/movielens_fm.html


---

[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r200611442
  
    --- Diff: core/src/main/java/hivemall/fm/FieldAwareFactorizationMachineModel.java ---
    @@ -51,11 +50,6 @@
         public FieldAwareFactorizationMachineModel(@Nonnull FFMHyperParameters params) {
             super(params);
             this._params = params;
    -        if (params.useAdaGrad) {
    -            this._eta0 = 1.0f;
    --- End diff --
    
    better to use large default eta0 for adagrad.


---

[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-hivemall/pull/149


---

[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

Posted by takuti <gi...@git.apache.org>.

Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r191315508
  
    --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
    @@ -379,23 +379,28 @@ protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException
             _model.check(x);
         }
     
    +    protected void processValidationSample(@Nonnull final Feature[] x, final double y)
    +            throws HiveException {
    +        if (_adaptiveRegularization) {
    +            trainLambda(x, y); // adaptive regularization
    --- End diff --
    
    373144d5151d1a57c3c13059ac70cc2e2908dc44



---

[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM

Posted by takuti <gi...@git.apache.org>.

Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    @myui Documentation and system tests on my laptop/EMR have been conducted, and I'm now ready for review. Could you look more deeply into the updates for merge?


---

[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r191309443
  
    --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineModel.java ---
    @@ -92,6 +92,14 @@ protected float getW(int i) {
     
         protected abstract void setW(@Nonnull Feature x, float nextWi);
     
    +    protected void setW(int i, float nextWi) {
    --- End diff --
    
    No need to have `protected void setW(int i, float nextWi)` and `protected void setW(@Nonnull String j, float nextWi)` in FactorizationMachineModel.
    



---

[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

Posted by takuti <gi...@git.apache.org>.

Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r191694422
  
    --- Diff: core/src/test/java/hivemall/fm/FieldAwareFactorizationMachineUDTFTest.java ---
    @@ -256,6 +256,19 @@ public void testEarlyStopping() throws HiveException, IOException {
                 cumulativeLoss > udtf._validationState.getCumulativeLoss());
         }
     
    +    @Test(expected = IllegalArgumentException.class)
    +    public void testUnsupportedAdaptiveRegularizationOption() throws Exception {
    +        TestUtils.testGenericUDTFSerialization(FieldAwareFactorizationMachineUDTF.class,
    +            new ObjectInspector[] {
    +                    ObjectInspectorFactory.getStandardListObjectInspector(
    +                        PrimitiveObjectInspectorFactory.javaStringObjectInspector),
    +                    PrimitiveObjectInspectorFactory.javaDoubleObjectInspector,
    +                    ObjectInspectorUtils.getConstantObjectInspector(
    +                        PrimitiveObjectInspectorFactory.javaStringObjectInspector,
    +                        "-seed 43 -adaptive_regularization")},
    +            new Object[][] {{Arrays.asList("0:1:-2", "1:2:-1"), 1.0}});
    --- End diff --
    
    690a20032d6810a05e0b0ebb73e284b8d6ea7cb7


---

[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r190842171
  
    --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
    @@ -563,6 +580,10 @@ protected void runTrainingIteration(int iterations) throws HiveException {
                     inputBuf.flip();
     
                     for (int iter = 2; iter <= iterations; iter++) {
    +                    if (earlyStopValidation) {
    --- End diff --
    
    better to avoid many `if (earlyStopValidation) {`.
    
    `_validateState` can always be non-null when `if(earlyStopValidation && _validateState.isLossIncreased()` never be true.


---

[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    @takuti Sure.


---

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    @takuti I advice to check 2-3 updates to investigate how gradient updates differ.


---

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    This kind of behavior could often be happen and Libffm's early stopping strategy is too aggressive.
    
    ```
           7      0.43239      0.46952
           8      0.42362      0.46999
           9      0.41394      0.45088 
    ```


---

[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r190843344
  
    --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
    @@ -352,9 +352,13 @@ private static void writeBuffer(@Nonnull ByteBuffer srcBuf, @Nonnull NioStateful
             srcBuf.clear();
         }
     
    +    protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException {
    +        _model.check(x);
    +    }
    +
         public void train(@Nonnull final Feature[] x, final double y,
                 final boolean adaptiveRegularization) throws HiveException {
    -        _model.check(x);
    +        checkInputVector(x);
     
             try {
                 if (adaptiveRegularization) {
    --- End diff --
    
    I think there are no need to share `train` if `adaptiveRegularization` is always be off for FFM and `early_stopping` is always off for FM. The logic in train becomes complex.


---

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    @takuti Thank you for detailed verification. Let's disable linear term by the default.
    
    Remove `-disable_wi` and `-enable_wi` (alias `-linear_term` ) to enable linear term.
    
    I'm not sure `-l2norm` should be enabled by default. What happens without `-l2norm` ?


---

[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r200605210
  
    --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
    @@ -351,19 +370,29 @@ private static void writeBuffer(@Nonnull ByteBuffer srcBuf, @Nonnull NioStateful
             srcBuf.clear();
         }
     
    -    public void train(@Nonnull final Feature[] x, final double y,
    -            final boolean adaptiveRegularization) throws HiveException {
    +    protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException {
             _model.check(x);
    +    }
    +
    +    protected void processValidationSample(@Nonnull final Feature[] x, final double y)
    +            throws HiveException {
    +        if (_adaptiveRegularization) {
    +            trainLambda(x, y); // adaptive regularization
    +        }
    +        if (_earlyStopping) {
    +            double p = _model.predict(x);
    +            double loss = _lossFunction.loss(p, y);
    +            _validationState.incrLoss(loss);
    +        }
    +    }
    +
    +    public void train(@Nonnull final Feature[] x, final double y, final boolean validation)
    +            throws HiveException {
    +        checkInputVector(x);
    --- End diff --
    
    avoid too many virtual method call. 
    
    `_model.check(x);` is enough both for FM and FFM.


---

[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r191309836
  
    --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineModel.java ---
    @@ -92,6 +92,14 @@ protected float getW(int i) {
     
         protected abstract void setW(@Nonnull Feature x, float nextWi);
     
    +    protected void setW(int i, float nextWi) {
    --- End diff --
    
    `setW(int i, float nextWi)` is no more used when avoid caching in early stopping.


---

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Posted by takuti <gi...@git.apache.org>.

Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    Note: I've extended LIBFFM code so it uses linear terms: https://github.com/takuti/criteo-ffm/commit/9aca61d93ed8f583025729206ed0dbfd54806a44 However, I cannot observe significant difference between LogLoss achieved with/without linear terms.


---

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    BTW, it might be better to implement `early stopping` using validation data.
    https://github.com/guestwalk/libffm
    
    We can use a similar approaches to `_validationRatio` used in `FactorizationMachineUDTF` instead of preparing validation dataset.


---

[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r200093417
  
    --- Diff: core/src/main/java/hivemall/fm/FieldAwareFactorizationMachineModel.java ---
    @@ -259,9 +255,9 @@ protected final float eta(@Nonnull final Entry theta, final long t, final float
         protected final float eta(@Nonnull final Entry theta, @Nonnegative final int f, final long t,
                 final float grad) {
             if (_useAdaGrad) {
    -            double gg = theta.getSumOfSquaredGradients(f);
    --- End diff --
    
    @takuti This behavior (that used in libffm) is wrong in strict sense and previous code is much better because initial eta should equals to `eta0` but this implementation depends on the initial gradient. 


---

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    @takuti 
    
    Linear term is not used in Libffm implementation. Better to do research about other FFM impl as well.
    https://github.com/chenhuang-learn/ffm/blob/master/ffm/src/ffm/FFMModel.java
    https://github.com/superclocks/ffm/blob/master/libffm-ftrl-1.13/ffm-train.cpp
    https://github.com/chenhuang-learn/ffm
    https://github.com/gaterslebenchen/JLibFFM/
    https://github.com/yuantiku/ytk-learn
    https://github.com/RTBHOUSE/cuda-ffm/ (modified version of FFM)
    
    The default Initial learning rate effects largely to convergence as well.
    
    For instance-wise normalization, better to follow discussions in 
    https://markmail.org/message/jwtr5xygfutl55oz 
    It performs well only when all feature are categorical. For his dataset, instance-wise l2 normalization performed very badly...
    
    https://gist.github.com/myui/aaeef548a17eb90c4e88f824c3ca1bcd


---

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    It might be better to reconsider `eta0` when enabling `l2norm` by the default and by enlarging`max_init_size`. In my experience for FM, init random size should be small when the avg feature dimension is large (gradients will be large).
    
    I think `1.0` is too aggressive for the default though. `0.2` or `0.5`? Better to research other implementations.


---

[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    revising in https://github.com/myui/incubator-hivemall/commits/HIVEMALL-201-2


---

[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r191308798
  
    --- Diff: core/src/main/java/hivemall/fm/FMArrayModel.java ---
    @@ -80,6 +80,11 @@ public float getW(@Nonnull final Feature x) {
         @Override
         protected void setW(@Nonnull Feature x, float nextWi) {
             int i = x.getFeatureIndex();
    +        setW(i, nextWi);
    --- End diff --
    
    better to avoid method call.


---

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Posted by takuti <gi...@git.apache.org>.

Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    Make sense as a compromise in terms of memory consumption. 
    
    I'll note on documentation to clarify the fact that our `-early_stopping` option does not return the best of the best model; users expect the option returns the best model achieved at the 7-th iteration, but our UDF does not behave so as discussed above.


---

[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

Posted by takuti <gi...@git.apache.org>.

Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r191145436
  
    --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
    @@ -563,6 +580,10 @@ protected void runTrainingIteration(int iterations) throws HiveException {
                     inputBuf.flip();
     
                     for (int iter = 2; iter <= iterations; iter++) {
    +                    if (earlyStopValidation) {
    --- End diff --
    
    `_validateState = NULL` allows us to confirm if at least one validation sample exists && early stopping option is enabled. Thanks to this implementation, we can avoid unnecessary call of `cacheCurrentModel` which consumes extra memory to hold the best model parameters.


---

[GitHub] incubator-hivemall issue #149: [HIVEMALL-201] Evaluate, fix and document FFM

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    > -lambda 0.0001 (default), -init_v adjusted_random
    0.6756640217829124      0.8644404496920104
    
    > -lambda 0.001, -init_v adjusted_random
    0.6749224090640931      0.8642914100412997
    
    > -lambda 0.002, -init_v adjusted_random
    0.6729486759257253      0.862249033512779
    
    > -lambda 0.01, -init_v adjusted_random
    0.6728088660666263      0.8568219312625348
    
    • libfm
    ```
    eta=0.1
    init_stdev=0.1
    reg0 = 0.0;
    regw = 0.0;
    regv = 0.0;
    ```
    
    https://github.com/srendle/libfm/blob/4ba0e0d5646da5d00701d853d19fbbe9b236cfd7/src/libfm/libfm.cpp#L87
    https://github.com/srendle/libfm/blob/30b9c799c41d043f31565cbf827bf41d0dc3e2ab/src/fm_core/fm_model.h#L73
    
    • libffm
    ```
    eta = 0.1; // learning rate
    lambda = 0.00002; // regularization parameter
    nr_iters = 15;
    k = 4; // number of latent factors
    ```
    
    https://github.com/srendle/libfm/blob/4ba0e0d5646da5d00701d853d19fbbe9b236cfd7/src/libfm/libfm.cpp#L84


---

[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

Posted by takuti <gi...@git.apache.org>.

Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r191149655
  
    --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
    @@ -352,9 +352,13 @@ private static void writeBuffer(@Nonnull ByteBuffer srcBuf, @Nonnull NioStateful
             srcBuf.clear();
         }
     
    +    protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException {
    +        _model.check(x);
    +    }
    +
         public void train(@Nonnull final Feature[] x, final double y,
                 final boolean adaptiveRegularization) throws HiveException {
    -        _model.check(x);
    +        checkInputVector(x);
     
             try {
                 if (adaptiveRegularization) {
    --- End diff --
    
    Well...the difference is only 3 lines:
    
    ```java
                    if (_adaptiveRegularization) {
                        trainLambda(x, y); // adaptive regularization
                    }
    ```
    
    Since FFM explicitly inherits FM code and shares many options, I just tried to remove duplicated codes between them as much as possible. Both FM and FFM use some training samples for validation in the same manner; the code should clearly show the fact.
    
    Alternative idea is like this: 5fbcc017e9557a10275811888437afdf6c4a0ad7


---

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Posted by takuti <gi...@git.apache.org>.

Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    Evaluation has been conducted at: [takuti/criteo-ffm](https://github.com/takuti/criteo-ffm). See the repository for detail.
    
    As an example, I have used tiny data provided at [guestwalk/kaggle-2014-criteo](https://github.com/guestwalk/kaggle-2014-criteo) which is already preprocessed and converted into the LIBFFM format:
    
    - Split 2,000 samples in `train.tiny.csv` to:
      - 1,587 training samples `tr.sp`
      - 412 validation samples `va.sp`
    
    As a consequence, FFM model created by LIBFFM and Hivemall with the following (almost similar) configuration showed very similar training loss and accuracy as follows.
    
    **LIBFFM**:
    
    ```
    $ ./ffm-train -k 4 -t 15 -l 0.00002 -r 0.2 -s 10 ../tr.sp model
    iter   tr_logloss      tr_time
       1      1.04980          0.0
       2      0.53771          0.0
       3      0.50963          0.0
       4      0.48980          0.1
       5      0.47469          0.1
       6      0.46304          0.1
       7      0.45289          0.1
       8      0.44400          0.1
       9      0.43653          0.1
      10      0.42947          0.1
      11      0.42330          0.1
      12      0.41727          0.1
      13      0.41130          0.1
      14      0.40558          0.1
      15      0.40036          0.1
    ```
    
     > LogLoss on validation set `va.sp`: 0.47237
    
    **Hivemall**:
    
    ```
    $ hive --hiveconf hive.root.logger=INFO,console
    hive> INSERT OVERWRITE TABLE criteo.ffm_model
        > SELECT
        >   train_ffm(features, label, '-init_v random -max_init_value 1.0 -classification -iterations 15 -factors 4 -eta 0.2 -l2norm -optimizer sgd -lambda 0.00002 -cv_rate 0.0 -disable_wi')
        > FROM (
        >   SELECT
        >     features, label
        >   FROM
        >     criteo.train_vectorized
        >   CLUSTER BY rand(1)
        > ) t
        > ;
    Record training examples to a file: /var/folders/rg/6mhvj7h567x_ys7brmf2bb6w0000gn/T/hivemall_fm6211397472147242886.sgmt
    Iteration #2 | average loss=0.5316043797079182, current cumulative loss=843.6561505964662, previous cumulative loss=1214.5909560888044, change rate=0.30539895232450376, #trainingExamples=1587
    Iteration #3 | average loss=0.5065999656968238, current cumulative loss=803.9741455608594, previous cumulative loss=843.6561505964662, change rate=0.04703575622313853, #trainingExamples=1587
    Iteration #4 | average loss=0.49634490612175397, current cumulative loss=787.6993660152235, previous cumulative loss=803.9741455608594, change rate=0.0202429140731664, #trainingExamples=1587
    Iteration #5 | average loss=0.48804954980765963, current cumulative loss=774.5346355447558, previous cumulative loss=787.6993660152235, change rate=0.0167128869698916, #trainingExamples=1587
    Iteration #6 | average loss=0.48072518575956447, current cumulative loss=762.9108698004288, previous cumulative loss=774.5346355447558, change rate=0.015007418920848658, #trainingExamples=1587
    Iteration #7 | average loss=0.47402279755334875, current cumulative loss=752.2741797171644, previous cumulative loss=762.9108698004288, change rate=0.013942244768444403, #trainingExamples=1587
    Iteration #8 | average loss=0.4677507471836629, current cumulative loss=742.320435780473, previous cumulative loss=752.2741797171644, change rate=0.013231537390308698, #trainingExamples=1587
    Iteration #9 | average loss=0.4618142861358177, current cumulative loss=732.8992720975427, previous cumulative loss=742.320435780473, change rate=0.012691505216375798, #trainingExamples=1587
    Iteration #10 | average loss=0.4561878517855827, current cumulative loss=723.9701207837197, previous cumulative loss=732.8992720975427, change rate=0.012183326759580433, #trainingExamples=1587
    Iteration #11 | average loss=0.45087834343992406, current cumulative loss=715.5439310391595, previous cumulative loss=723.9701207837197, change rate=0.01163886395675921, #trainingExamples=1587
    Iteration #12 | average loss=0.4458864402438874, current cumulative loss=707.6217806670493, previous cumulative loss=715.5439310391595, change rate=0.011071508021324606, #trainingExamples=1587
    Iteration #13 | average loss=0.44118468270053807, current cumulative loss=700.1600914457539, previous cumulative loss=707.6217806670493, change rate=0.010544742156271002, #trainingExamples=1587
    Iteration #14 | average loss=0.4367191822212713, current cumulative loss=693.0733421851576, previous cumulative loss=700.1600914457539, change rate=0.01012161268141256, #trainingExamples=1587
    Iteration #15 | average loss=0.4324248854220929, current cumulative loss=686.2582931648615, previous cumulative loss=693.0733421851576, change rate=0.009833084906727563, #trainingExamples=1587
    Performed 15 iterations of 1,587 training examples on memory (thus 23,805 training updates in total)
    ```
    
    > LogLoss on the same validation set: 0.47604112308042346
    
    Note that, since we used `-l2norm` option for training,  the validation samples should also be L2 normalized as: `feature_pairs(l2_normalize(t1.features), '-ffm')`
    
    While a choice of hyper-parameters and optimizer (SGD/FTRL/AdaGrad) affects to the accuracy to some degree, I have noticed `-disable_wi` can be a more important factor on this data. If we use the liner terms to train FFM model, LogLoss on `va.sp` is significantly increased to `1.5227099483928919`.
    
    I'm still not sure if the result is natural or be caused by a bug. Let me double-check the implementation.


---

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Posted by takuti <gi...@git.apache.org>.

Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    ### With linear terms
    
    #### Hivemall
    
    ```sql
    INSERT OVERWRITE TABLE criteo.ffm_model
    SELECT
      train_ffm(features, label, '-init_v random -max_init_value 0.5 -classification -iterations 15 -factors 4 -eta 0.2 -l2norm -optimizer adagrad -lambda 0.00002 -cv_rate 0.0')
    FROM (
      SELECT
        features, label
      FROM
        criteo.train_vectorized
      CLUSTER BY rand(1)
    ) t
    ;
    ```
    
    ```
    Iteration #2 | average loss=0.474651712453725, current cumulative loss=753.2722676640616, previous cumulative loss=990.2550021169766, change rate=0.23931485722999737, #trainingExamples=1587
    Iteration #3 | average loss=0.4499051385165006, current cumulative loss=713.9994548256865, previous cumulative loss=753.2722676640616, change rate=0.05213627863954456, #trainingExamples=1587
    Iteration #4 | average loss=0.4342257595710771, current cumulative loss=689.1162804392994, previous cumulative loss=713.9994548256865, change rate=0.03485041090467212, #trainingExamples=1587
    Iteration #5 | average loss=0.4225120903723549, current cumulative loss=670.5266874209271, previous cumulative loss=689.1162804392994, change rate=0.026975988735198287, #trainingExamples=1587
    Iteration #6 | average loss=0.41300825971798527, current cumulative loss=655.4441081724426, previous cumulative loss=670.5266874209271, change rate=0.022493630054453533, #trainingExamples=1587
    Iteration #7 | average loss=0.40491514701335013, current cumulative loss=642.6003383101867, previous cumulative loss=655.4441081724426, change rate=0.019595522641995967, #trainingExamples=1587
    Iteration #8 | average loss=0.3978014571916465, current cumulative loss=631.310912563143, previous cumulative loss=642.6003383101867, change rate=0.017568347033135524, #trainingExamples=1587
    Iteration #9 | average loss=0.3914067263636397, current cumulative loss=621.1624747390962, previous cumulative loss=631.310912563143, change rate=0.016075182009517044, #trainingExamples=1587
    Iteration #10 | average loss=0.3855609819906249, current cumulative loss=611.8852784191217, previous cumulative loss=621.1624747390962, change rate=0.014935216947661086, #trainingExamples=1587
    Iteration #11 | average loss=0.3801467153362753, current cumulative loss=603.2928372386689, previous cumulative loss=611.8852784191217, change rate=0.01404256889894858, #trainingExamples=1587
    Iteration #12 | average loss=0.3750791243746283, current cumulative loss=595.2505703825351, previous cumulative loss=603.2928372386689, change rate=0.01333061883005943, #trainingExamples=1587
    Iteration #13 | average loss=0.37029474458756273, current cumulative loss=587.657759660462, previous cumulative loss=595.2505703825351, change rate=0.012755654676976761, #trainingExamples=1587
    Iteration #14 | average loss=0.36574472099268607, current cumulative loss=580.4368722153928, previous cumulative loss=587.657759660462, change rate=0.012287572700888608, #trainingExamples=1587
    Iteration #15 | average loss=0.3613904840032808, current cumulative loss=573.5266981132066, previous cumulative loss=580.4368722153928, change rate=0.011905126005885216, #trainingExamples=1587
    Performed 15 iterations of 1,587 training examples on memory (thus 23,805 training updates in total)
    ```
    > LogLoss: 0.4771035166468042
    
    #### LIBFFM
    
    ```
    $ ./ffm-train -k 4 -t 15 -l 0.00002 -r 0.2 -s 1 ../tr.sp model
    First check if the text file has already been converted to binary format (0.0 seconds)
    Binary file NOT found. Convert text file to binary file (0.0 seconds)
    iter   tr_logloss      tr_time
       1      0.62043          0.0
       2      0.47533          0.1
       3      0.44968          0.1
       4      0.43548          0.2
       5      0.42261          0.2
       6      0.41322          0.3
       7      0.40489          0.3
       8      0.39687          0.4
       9      0.39085          0.4
      10      0.38530          0.4
      11      0.37965          0.5
      12      0.37450          0.5
      13      0.36937          0.6
      14      0.36444          0.6
      15      0.36031          0.7
    $ ./ffm-predict ../va.sp model submission.csv
    logloss = 0.47818
    ```
    
    ### Without linear terms (i.e., adding `-disable_wi` option)
    
    #### Hivemall
    
    ```
    Iteration #2 | average loss=0.539961924393562, current cumulative loss=856.919574012583, previous cumulative loss=1651.6985545424677, change rate=0.48118888179934516, #trainingExamples=1587
    Iteration #3 | average loss=0.5106114115327627, current cumulative loss=810.3403101024943, previous cumulative loss=856.919574012583, change rate=0.05435663430113771, #trainingExamples=1587
    Iteration #4 | average loss=0.4906722901321148, current cumulative loss=778.6969244396662, previous cumulative loss=810.3403101024943, change rate=0.03904950212686045, #trainingExamples=1587
    Iteration #5 | average loss=0.4754916462118607, current cumulative loss=754.6052425382229, previous cumulative loss=778.6969244396662, change rate=0.030938457755922362, #trainingExamples=1587
    Iteration #6 | average loss=0.46330291728471334, current cumulative loss=735.2617297308401, previous cumulative loss=754.6052425382229, change rate=0.025633949669257704, #trainingExamples=1587
    Iteration #7 | average loss=0.453140805287918, current cumulative loss=719.1344579919258, previous cumulative loss=735.2617297308401, change rate=0.021934055706691043, #trainingExamples=1587
    Iteration #8 | average loss=0.44439540937886607, current cumulative loss=705.2555146842604, previous cumulative loss=719.1344579919258, change rate=0.019299510895946, #trainingExamples=1587
    Iteration #9 | average loss=0.4366611986545602, current cumulative loss=692.9813222647871, previous cumulative loss=705.2555146842604, change rate=0.017403894282157387, #trainingExamples=1587
    Iteration #11 | average loss=0.42321511843877446, current cumulative loss=671.6423929623351, previous cumulative loss=681.8770641514493, change rate=0.015009554840872389, #trainingExamples=1587
    Iteration #12 | average loss=0.4171781468097722, current cumulative loss=662.0617189871085, previous cumulative loss=671.6423929623351, change rate=0.01426454624606136, #trainingExamples=1587
    Iteration #13 | average loss=0.411451696404218, current cumulative loss=652.973842193494, previous cumulative loss=662.0617189871085, change rate=0.013726630815504848, #trainingExamples=1587
    Iteration #14 | average loss=0.40595767772793845, current cumulative loss=644.2548345542383, previous cumulative loss=652.973842193494, change rate=0.013352767103145282, #trainingExamples=1587
    Iteration #15 | average loss=0.4006353270154049, current cumulative loss=635.8082639734475, previous cumulative loss=644.2548345542383, change rate=0.013110604884532947, #trainingExamples=1587
    Performed 15 iterations of 1,587 training examples on memory (thus 23,805 training updates in total)
    ```
    > LogLoss: 0.4757278678816663
    
    #### LIBFFFM
    
    ```
    $ ./ffm-train -k 4 -t 15 -l 0.00002 -r 0.2 -s 1 --disable-wi ../tr.sp model
    First check if the text file has already been converted to binary format (0.0 seconds)
    Binary file found. Skip converting text to binary
    iter   tr_logloss      tr_time
       1      1.03199          0.1
       2      0.53894          0.1
       3      0.51018          0.1
       4      0.49096          0.2
       5      0.47549          0.2
       6      0.46334          0.3
       7      0.45313          0.3
       8      0.44405          0.3
       9      0.43662          0.4
      10      0.42985          0.4
      11      0.42337          0.5
      12      0.41732          0.5
      13      0.41140          0.6
      14      0.40583          0.6
      15      0.40049          0.6
    $ ./ffm-predict ../va.sp model submission.csv
    logloss = 0.47284
    ```
    
    FFM w/o linear terms works slightly better in both Hivemall and LIBFFM.


---

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Posted by takuti <gi...@git.apache.org>.

Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    I'll change default options and consider to implement early stopping option as you suggested.
    
    > What happens without `-l2norm` ?
    
    Once we drop instance-wise L2 normalization, a model easily overfits to training samples, and prediction accuracy gets exceptionally worse.
    
    **LIBFFM**:
    
    ```
    $ ./ffm-train -k 4 -t 15 -l 0.00002 -r 0.2 -s 1 --no-norm ../tr.sp model
    First check if the text file has already been converted to binary format (0.0 seconds)
    Binary file NOT found. Convert text file to binary file (0.0 seconds)
    iter   tr_logloss      tr_time
       1      4.24374          0.0
       2      0.53960          0.1
       3      0.09525          0.2
       4      0.01288          0.2
       5      0.00215          0.3
       6      0.00133          0.3
       7      0.00112          0.3
       8      0.00098          0.4
       9      0.00089          0.4
      10      0.00082          0.5
      11      0.00076          0.5
      12      0.00072          0.6
      13      0.00068          0.6
      14      0.00064          0.6
      15      0.00061          0.7
    $ ./ffm-predict ../va.sp model submission.csv
    logloss = 1.75623
    ```
    
    **Hivemall**:
    
    ```
    Iteration #2 | average loss=0.5186307939402891, current cumulative loss=823.0670699832388, previous cumulative loss=6640.3299608989755, change rate=0.876050275388452, #trainingExamples=1587
    Iteration #3 | average loss=0.06870252595245425, current cumulative loss=109.0309086865449, previous cumulative loss=823.0670699832388, change rate=0.8675309550547743, #trainingExamples=1587
    Iteration #4 | average loss=0.01701292407900819, current cumulative loss=26.999510513386, previous cumulative loss=109.0309086865449, change rate=0.7523682886014696, #trainingExamples=1587
    Iteration #5 | average loss=0.003132377872105223, current cumulative loss=4.971083683030989, previous cumulative loss=26.999510513386, change rate=0.8158824516256917, #trainingExamples=1587
    Iteration #6 | average loss=0.001693780516846469, current cumulative loss=2.6880296802353465, previous cumulative loss=4.971083683030989, change rate=0.4592668617888987, #trainingExamples=1587
    Iteration #7 | average loss=0.0013357168592237345, current cumulative loss=2.1197826555880668, previous cumulative loss=2.6880296802353465, change rate=0.21139908864307172, #trainingExamples=1587
    Iteration #8 | average loss=0.0011459013923848537, current cumulative loss=1.8185455097147627, previous cumulative loss=2.1197826555880668, change rate=0.1421075623386188, #trainingExamples=1587
    Iteration #9 | average loss=0.001017751388111345, current cumulative loss=1.6151714529327046, previous cumulative loss=1.8185455097147627, change rate=0.11183336116452601, #trainingExamples=1587
    Iteration #10 | average loss=9.230266490923267E-4, current cumulative loss=1.4648432921095225, previous cumulative loss=1.6151714529327046, change rate=0.0930725716766649, #trainingExamples=1587
    Iteration #11 | average loss=8.493080071393429E-4, current cumulative loss=1.3478518073301373, previous cumulative loss=1.4648432921095225, change rate=0.07986621190783184, #trainingExamples=1587
    Iteration #12 | average loss=7.898623710141035E-4, current cumulative loss=1.2535115827993821, previous cumulative loss=1.3478518073301373, change rate=0.0699930244687856, #trainingExamples=1587
    Iteration #13 | average loss=7.406521210973545E-4, current cumulative loss=1.1754149161815017, previous cumulative loss=1.2535115827993821, change rate=0.06230230951952787, #trainingExamples=1587
    Iteration #14 | average loss=6.990685420175246E-4, current cumulative loss=1.1094217761818115, previous cumulative loss=1.1754149161815017, change rate=0.056144548696113294, #trainingExamples=1587
    Iteration #15 | average loss=6.633493164996776E-4, current cumulative loss=1.0527353652849885, previous cumulative loss=1.1094217761818115, change rate=0.051095455410939475, #trainingExamples=1587
    Performed 15 iterations of 1,587 training examples on memory (thus 23,805 training updates in total)
    ```
    
    ```
    LogLoss: 1.8970086009757248
    ```


---

[GitHub] incubator-hivemall pull request #149: [WIP][HIVEMALL-201] Evaluate, fix and ...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r191298514
  
    --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
    @@ -379,23 +379,28 @@ protected void checkInputVector(@Nonnull final Feature[] x) throws HiveException
             _model.check(x);
         }
     
    +    protected void processValidationSample(@Nonnull final Feature[] x, final double y)
    +            throws HiveException {
    +        if (_adaptiveRegularization) {
    +            trainLambda(x, y); // adaptive regularization
    --- End diff --
    
    `FFM fully ignores adaptive regularization option` is expected behavior.
    Not tested AdaptiveRegularization with FFM and/or FTRL.


---

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    Also, it's better to revise default `-iters` from 1 to 10 (at least 10 iterations with early stopping).


---

[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r200590772
  
    --- Diff: core/src/main/java/hivemall/fm/FactorizationMachineUDTF.java ---
    @@ -283,9 +293,16 @@ public void process(Object[] args) throws HiveException {
             }
     
             ++_t;
    -        recordTrain(x, y);
    -        boolean adaptiveRegularization = (_va_rand != null) && _t >= _validationThreshold;
    -        train(x, y, adaptiveRegularization);
    +
    +        boolean validation = false;
    +        if ((_va_rand != null) && _t >= _validationThreshold) {
    +            final float rnd = _va_rand.nextFloat();
    +            validation = rnd < _validationRatio;
    +        }
    +
    +        recordTrain(x, y, validation);
    +
    +        train(x, y, validation);
    --- End diff --
    
    Validation examples are fixed in this implementation. Also, not using non-validation examples for regularization is a bad strategy.


---

[GitHub] incubator-hivemall pull request #149: [HIVEMALL-201] Evaluate, fix and docum...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/149#discussion_r211470802
  
    --- Diff: core/src/main/java/hivemall/fm/FieldAwareFactorizationMachineModel.java ---
    @@ -123,17 +117,18 @@ void updateWi(final double dloss, @Nonnull final Feature x, final long t) {
             }
     
             final double Xi = x.getValue();
    -        float gradWi = (float) (dloss * Xi);
     
             final Entry theta = getEntryW(x);
             float wi = theta.getW();
     
    -        final float eta = eta(theta, t, gradWi);
    -        float nextWi = wi - eta * (gradWi + 2.f * _lambdaW * wi);
    +        float grad = (float) (dloss * Xi + 2.f * _lambdaW * wi);
    --- End diff --
    
    regularization should not be performed here (?)


---