You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hivemall.apache.org by myui <gi...@git.apache.org> on 2018/08/23 06:28:06 UTC

[GitHub] incubator-hivemall pull request #155: [HIVEMALL-201-2] Evaluate, fix and doc...

GitHub user myui opened a pull request:

    https://github.com/apache/incubator-hivemall/pull/155

    [HIVEMALL-201-2] Evaluate, fix and document FFM

    ## What changes were proposed in this pull request?
    
    Applied some refactoring to #149 
    This PR closes #149 
    
    ## What type of PR is it?
    
    Hot Fix, Refactoring
    
    ## What is the Jira issue?
    
    https://issues.apache.org/jira/browse/HIVEMALL-201
    
    ## How was this patch tested?
    
    unit tests, manual tests
    
    ## How to use this feature?
    
    Will be published at: http://hivemall.incubator.apache.org/userguide/binaryclass/criteo_ffm.html
    
    ## Checklist
    
    (Please remove this section if not needed; check `x` for YES, blank for NO)
    
    - [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
    - [x] Did you run system tests on Hive (or Spark)?


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/myui/incubator-hivemall HIVEMALL-201-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hivemall/pull/155.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #155
    
----
commit c4d6855d6286249e150e4c8dcd5413bcde339990
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-16T08:39:32Z

    Use pre-defined constants in option description

commit f7e7e1d49e5fa2e4f4f50d55f85c5cdee3bb69b1
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-16T08:40:48Z

    Fix mismatch between opts.addOption and cl.getOptionValue

commit 929781a982f86851e38d558bb79a239d90c90e76
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-16T08:41:34Z

    Support FFM feature format in `l1_normalize` and `l2_normalize`

commit a1751361f8ae2204cdc6507514945ebaa1ddf179
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-21T06:02:14Z

    Increase `alphaFTRL` in `testSampleEnableNorm` for convergence

commit ff049d776133d1bc0cf7e62d9740f22a3943f593
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-22T02:16:51Z

    Fix typo

commit 35a02451fc4e8a55bbb49b7fede3c545145b7d6e
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-22T05:22:35Z

    Fix bug in forward model
    
    Due to typo, linear weights in model are not correctly forwarded.

commit 9782136e3059df1d334c814c9eb9455e1ec9b573
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-22T06:39:22Z

    Fix order of computing AdaGrad learning rate
    
    * Gradient includes regularization term
    * Get sum of squared gradient after adding the latest gradient
    
    See:
    https://github.com/guestwalk/libffm/blob/7db5b4f1ad3af7eb5bd0c224b2fa5305e1a715d2/ffm.cpp#L219-L226

commit 2366d910581248249a4e69e1110675469a17ea99
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-22T06:47:03Z

    Enable to specify initial learn rate for AdaGrad

commit f1fd20cd508a8473bd0fef037cd708d5c3379c5f
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-22T08:35:36Z

    Make `-max_init_value` more meaningful
    
    In fact, the code sampled random value from [0, max_init_value / k], but
    users expect that each element in V is exactly initialized random values
    in [0, max_init_value].

commit 478f26dab385b3835cdfbe19d40beef47336d92d
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-23T05:19:17Z

    Add `-l2norm` option to FeaturePairsUDTF
    
    Users can configure if feature vector is L2 normalized in a similar way
    to `train_ffm`.

commit 3627ca84e857210aa921fd607fed19759d26fba0
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-23T06:27:02Z

    Switch `-disable_wi` option to `-enable_wi`

commit e2c378f5134c67d25047169324c6aa9df62e8b8f
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-23T07:01:09Z

    Fix test broken by change of default learn rate for FFM+AdaGrad

commit 056dfde30437c9bbcfca4444f292698ba97dfa67
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-23T07:27:34Z

    FFM applies instance-wise L2 normalization by default

commit 91aed6ecdc5401d972eac534e54246c59fd15ebb
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-24T00:48:37Z

    Increase default number of iterations to rely more on cv_test

commit dca7e5762d664039354d00da8c3ca9adccd5d7c2
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-24T04:23:24Z

    Make default L2 regularization parameter smaller
    
    New default value 0.0001 is same as FTRL and general
    regressor/classifier.
    
    0.01 was large on small data; a model cannot be successfully learnt in
    some cases. By contrast, LIBFFM uses very small value 0.00002 by
    default.  This commit sets 0.0001, a middle of these values, as a
    compromise.

commit f84c960285f04ada21fb346e94ed0b5683d31289
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-24T04:49:27Z

    Increase default learn rate from 0.05 to 0.1
    
    Referred the following implementations.
    
    LIBFFM: 0.2 (with AdaGrad)
    https://github.com/guestwalk/libffm/blob/740103e5eb920a4061dd8e977a2ede6d23c6910a/ffm.h#L31
    
    libFM: 0.1
    https://github.com/srendle/libfm/blob/4ba0e0d5646da5d00701d853d19fbbe9b236cfd7/src/libfm/libfm.cpp#L87

commit 5b9d36746d1bf432098a7a8ad02be3f5db1bef3e
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-24T05:06:22Z

    Update FFM unit test cases
    
    * Remove `runIterations` method and use `run` with appropriate `-iters` option
    * Follow up previous change of default options
    * Drop some options and confirm if their default values reasonaly work

commit 3a11ca096f1bd5287ef857f0781fa61a5e6efa4d
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-24T08:34:39Z

    FFM UDTF does not override train() method in FM UDF
    
    The only difference between them is in the type of model instance; FFM
    checks `_ffmModel`, and FM refers `_model`.
    
    Note that `adaptiveRegularization` is always false in FFM.

commit a48e8017339ba8b284fdeeb6bf21ee6ed2159983
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-24T08:44:17Z

    Use consistent set of validation samples over iterations
    
    Store if a sample is used for validation for the later iterations.

commit 38875b91287a821db5a8f3ea3c307576378ce485
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-25T06:05:04Z

    Support `-early_stopping` option in FM/FFM by using validation samples
    
    This implementation is still incomplete:
    If validation loss is increased at n-th iteration, we should forward
    previous model obtained at (n-1)-th iteration.

commit 3ca451c6bb7e1a74a0a309b1c3e4892f1717ba40
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-25T06:27:33Z

    Fix typo: validatiState -> validationState

commit eb943d935d7b91054016de83273db9f86688d853
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-28T07:25:32Z

    Enable to set W and V by directly pointing to feature index

commit 2875fe98b72bda17946ca0692aef3b8c4f9af86c
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-28T07:33:37Z

    Enable to cache/restore FFM model parameters for early stopping

commit b670698c4d4e8188baeaffb13aac38ff55da0a03
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-28T08:15:25Z

    Update early stopping option test case
    
    This version of test cases checks if:
    - early stopping works as expected
    - early stopping holds correct "best" model parameters

commit 714298f608396bb3f817059827df9e27bc34591a
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-28T08:24:44Z

    Fix missing `cacheCurrentModel` call

commit 700b40cb3829996a97b2caf831776e4fbaffdf51
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-28T08:29:51Z

    Format code

commit cc7e1010ed91930e067cfa15d6726f455bcece8e
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-28T08:42:07Z

    FFM fully ignores adaptive regularization option

commit 42f9b97352978f07ae0479c7a75862d490f937fc
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-29T05:22:05Z

    Stop cache the best of the best model parameters
    
    Caching previous model parameters consumes 2x memory. In order to avoid
    consuming such crazy amount of memory, `-early_stopping` option forwards
    a model obtained at (N+1)-th iteration as a compromise, when training is
    stopped earlier at N-th iteration.

commit 9f6a761f4abe017a2fa17590e1f5e2d40fe6fcda
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-29T05:28:45Z

    Make `_validationState` non-null for simplicity

commit b1fc49b8a86295d7c3b0fed284e118128450180c
Author: Takuya Kitazawa <k....@...>
Date:   2018-05-29T05:53:43Z

    Stop iteration iff. loss is consecutively increased over 2 iters
    
    "Immediately stop training once loss is increased" might be too
    aggressive.

----


---

[GitHub] incubator-hivemall pull request #155: [HIVEMALL-201] Evaluate, fix and docum...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-hivemall/pull/155


---

[GitHub] incubator-hivemall issue #155: [HIVEMALL-201-2] Evaluate, fix and document F...

Posted by myui <gi...@git.apache.org>.

Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/155
  
    @takuti will merge after EMR tests. FYI


---