You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hivemall.apache.org by takuti <gi...@git.apache.org> on 2018/05/17 06:13:15 UTC

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/149
  
    Evaluation has been conducted at: [takuti/criteo-ffm](https://github.com/takuti/criteo-ffm). See the repository for detail.
    
    As an example, I have used tiny data provided at [guestwalk/kaggle-2014-criteo](https://github.com/guestwalk/kaggle-2014-criteo) which is already preprocessed and converted into the LIBFFM format:
    
    - Split 2,000 samples in `train.tiny.csv` to:
      - 1,587 training samples `tr.sp`
      - 412 validation samples `va.sp`
    
    As a consequence, FFM model created by LIBFFM and Hivemall with the following (almost similar) configuration showed very similar training loss and accuracy as follows.
    
    **LIBFFM**:
    
    ```
    $ ./ffm-train -k 4 -t 15 -l 0.00002 -r 0.2 -s 10 ../tr.sp model
    iter   tr_logloss      tr_time
       1      1.04980          0.0
       2      0.53771          0.0
       3      0.50963          0.0
       4      0.48980          0.1
       5      0.47469          0.1
       6      0.46304          0.1
       7      0.45289          0.1
       8      0.44400          0.1
       9      0.43653          0.1
      10      0.42947          0.1
      11      0.42330          0.1
      12      0.41727          0.1
      13      0.41130          0.1
      14      0.40558          0.1
      15      0.40036          0.1
    ```
    
     > LogLoss on validation set `va.sp`: 0.47237
    
    **Hivemall**:
    
    ```
    $ hive --hiveconf hive.root.logger=INFO,console
    hive> INSERT OVERWRITE TABLE criteo.ffm_model
        > SELECT
        >   train_ffm(features, label, '-init_v random -max_init_value 1.0 -classification -iterations 15 -factors 4 -eta 0.2 -l2norm -optimizer sgd -lambda 0.00002 -cv_rate 0.0 -disable_wi')
        > FROM (
        >   SELECT
        >     features, label
        >   FROM
        >     criteo.train_vectorized
        >   CLUSTER BY rand(1)
        > ) t
        > ;
    Record training examples to a file: /var/folders/rg/6mhvj7h567x_ys7brmf2bb6w0000gn/T/hivemall_fm6211397472147242886.sgmt
    Iteration #2 | average loss=0.5316043797079182, current cumulative loss=843.6561505964662, previous cumulative loss=1214.5909560888044, change rate=0.30539895232450376, #trainingExamples=1587
    Iteration #3 | average loss=0.5065999656968238, current cumulative loss=803.9741455608594, previous cumulative loss=843.6561505964662, change rate=0.04703575622313853, #trainingExamples=1587
    Iteration #4 | average loss=0.49634490612175397, current cumulative loss=787.6993660152235, previous cumulative loss=803.9741455608594, change rate=0.0202429140731664, #trainingExamples=1587
    Iteration #5 | average loss=0.48804954980765963, current cumulative loss=774.5346355447558, previous cumulative loss=787.6993660152235, change rate=0.0167128869698916, #trainingExamples=1587
    Iteration #6 | average loss=0.48072518575956447, current cumulative loss=762.9108698004288, previous cumulative loss=774.5346355447558, change rate=0.015007418920848658, #trainingExamples=1587
    Iteration #7 | average loss=0.47402279755334875, current cumulative loss=752.2741797171644, previous cumulative loss=762.9108698004288, change rate=0.013942244768444403, #trainingExamples=1587
    Iteration #8 | average loss=0.4677507471836629, current cumulative loss=742.320435780473, previous cumulative loss=752.2741797171644, change rate=0.013231537390308698, #trainingExamples=1587
    Iteration #9 | average loss=0.4618142861358177, current cumulative loss=732.8992720975427, previous cumulative loss=742.320435780473, change rate=0.012691505216375798, #trainingExamples=1587
    Iteration #10 | average loss=0.4561878517855827, current cumulative loss=723.9701207837197, previous cumulative loss=732.8992720975427, change rate=0.012183326759580433, #trainingExamples=1587
    Iteration #11 | average loss=0.45087834343992406, current cumulative loss=715.5439310391595, previous cumulative loss=723.9701207837197, change rate=0.01163886395675921, #trainingExamples=1587
    Iteration #12 | average loss=0.4458864402438874, current cumulative loss=707.6217806670493, previous cumulative loss=715.5439310391595, change rate=0.011071508021324606, #trainingExamples=1587
    Iteration #13 | average loss=0.44118468270053807, current cumulative loss=700.1600914457539, previous cumulative loss=707.6217806670493, change rate=0.010544742156271002, #trainingExamples=1587
    Iteration #14 | average loss=0.4367191822212713, current cumulative loss=693.0733421851576, previous cumulative loss=700.1600914457539, change rate=0.01012161268141256, #trainingExamples=1587
    Iteration #15 | average loss=0.4324248854220929, current cumulative loss=686.2582931648615, previous cumulative loss=693.0733421851576, change rate=0.009833084906727563, #trainingExamples=1587
    Performed 15 iterations of 1,587 training examples on memory (thus 23,805 training updates in total)
    ```
    
    > LogLoss on the same validation set: 0.47604112308042346
    
    Note that, since we used `-l2norm` option for training,  the validation samples should also be L2 normalized as: `feature_pairs(l2_normalize(t1.features), '-ffm')`
    
    While a choice of hyper-parameters and optimizer (SGD/FTRL/AdaGrad) affects to the accuracy to some degree, I have noticed `-disable_wi` can be a more important factor on this data. If we use the liner terms to train FFM model, LogLoss on `va.sp` is significantly increased to `1.5227099483928919`.
    
    I'm still not sure if the result is natural or be caused by a bug. Let me double-check the implementation.


---