You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/09 03:17:30 UTC
[GitHub] [spark] zhengruifeng edited a comment on issue #26735: [SPARK-30102][WIP][ML][PYSPARK] GMM supports instance weighting

zhengruifeng edited a comment on issue #26735: [SPARK-30102][WIP][ML][PYSPARK] GMM supports instance weighting
URL: https://github.com/apache/spark/pull/26735#issuecomment-563044225
 
 
   It really took me some days to look into the test failure.
   
   1, In 2.4.4, I can not reproduce the doctests:
   ```python
       >>> summary.clusterSizes
       [2, 2, 2]
       >>> summary.logLikelihood
       8.14636...
   ```
   until I explictly set numPartition=2, like this `df = spark.createDataFrame(sc.parallelize(data, 2), ["features"])`.
   That is because existing `df = spark.createDataFrame(data, ["features"])` will create a df with 12 partitions, and GMM is highly sensitive to the intialization.
   It is also weird to me that `spark.createDataFrame` will create a df with 6 partitions in the scala side.
   My latop has a 8850 cpu with 6cores and 12threads.
   
   2, After using `df = spark.createDataFrame(sc.parallelize(data, 2), ["features"])`, I can reproduce the results in 2.4.4. However, the doctests still fail, I log the optimization metric `logLikelihood` after each iteration and find that it seems a sudden numeric change.
   
   Iteration|0|1|2|3|4|5|6|7|8|9|
   |------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|
   Master|-13.306466494963615|-0.4307654468425961|0.49157579336057605|2.234212048899172|6.125367537295512|11.27762326533469|35.232285502171976|10.028821186214191|23.693392686726106|8.146360246481793|
   This PR|-13.306466494963615|-0.430765446842597|0.4915757933605755|2.234212048899182|6.125367537295558|11.277623265335476|35.229680601767065|46.33491773124833|57.694248782061024|26.193922336279954|
   
   The metrics are near before iter-7, but some sudden numeric change happened in iter-7. But I think it is acceptable since the internal computation is complex.
   Moreover, current convergence check `math.abs(logLikelihood - logLikelihoodPrev) > $(tol)` do not work when optimization objective meet a big hit. Like `logLikelihood` drop from 35.232285502171976 to 10.028821186214191 in iter-7.
   
   So I think I need to:
   1, change the df generation logic with explictly set numpartition; (current `createDataFrame` do not support this input, I need to create a rdd first)
   2, change the result in the doctest (I tend to set `MaxIter=5` and result=11.27)
   3, change the convergence check and avoid big drop in optimzation metric(maybe in another PR and check other algs in it)
   
   @srowen @huaxingao  How do you think about it?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org