You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by yl...@apache.org on 2016/08/12 17:06:29 UTC
spark git commit: [SPARK-17033][ML][MLLIB] GaussianMixture should use
treeAggregate to improve performance
Repository: spark
Updated Branches:
refs/heads/master 79e2caa13 -> bbae20ade
[SPARK-17033][ML][MLLIB] GaussianMixture should use treeAggregate to improve performance
## What changes were proposed in this pull request?
```GaussianMixture``` should use ```treeAggregate``` rather than ```aggregate``` to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there is 20% increased performance.
BTW, we should destroy broadcast variable ```compute``` at the end of each iteration.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <yb...@gmail.com>
Closes #14621 from yanboliang/spark-17033.
Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bbae20ad
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bbae20ad
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bbae20ad
Branch: refs/heads/master
Commit: bbae20ade14e50541e4403ca7b45bf6c11695d15
Parents: 79e2caa
Author: Yanbo Liang <yb...@gmail.com>
Authored: Fri Aug 12 10:06:17 2016 -0700
Committer: Yanbo Liang <yb...@gmail.com>
Committed: Fri Aug 12 10:06:17 2016 -0700
----------------------------------------------------------------------
.../scala/org/apache/spark/mllib/clustering/GaussianMixture.scala | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/spark/blob/bbae20ad/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
----------------------------------------------------------------------
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
index a214b1a..43193ad 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
@@ -198,7 +198,7 @@ class GaussianMixture private (
val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_)
// aggregate the cluster contribution for all sample points
- val sums = breezeData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
+ val sums = breezeData.treeAggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
// Create new distributions based on the partial assignments
// (often referred to as the "M" step in literature)
@@ -227,6 +227,7 @@ class GaussianMixture private (
llhp = llh // current becomes previous
llh = sums.logLikelihood // this is the freshly computed log-likelihood
iter += 1
+ compute.destroy(blocking = false)
}
new GaussianMixtureModel(weights, gaussians)
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org