You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Weichen Xu (JIRA)" <ji...@apache.org> on 2016/07/24 14:02:20 UTC

[jira] [Updated] (SPARK-16697) redundant RDD computation in LDAOptimizer

     [ https://issues.apache.org/jira/browse/SPARK-16697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Weichen Xu updated SPARK-16697:
-------------------------------
    Description: 
In mllib.clustering.LDAOptimizer
the submitMiniBatch method,

the stats: RDD do not persist but the following code will use it twice.
so it cause redundant computation on it.

and there is another problem,
the expElogbetaBc broadcast variable is unpersist too early,
and the next statement 
`
val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
       stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*)
`
will re-compute the stats RDD, it will use expElogbetaBc broadcast variable again,
so the  expElogbetaBc broadcast variable will be broadcast again.



  was:
In mllib.clustering.LDAOptimizer
the submitMiniBatch method,

the stats: RDD do not persist but the following code will use it twice.
so it cause redundant computation on it.


> redundant RDD computation in LDAOptimizer
> -----------------------------------------
>
>                 Key: SPARK-16697
>                 URL: https://issues.apache.org/jira/browse/SPARK-16697
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 2.0.1, 2.1.0
>            Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In mllib.clustering.LDAOptimizer
> the submitMiniBatch method,
> the stats: RDD do not persist but the following code will use it twice.
> so it cause redundant computation on it.
> and there is another problem,
> the expElogbetaBc broadcast variable is unpersist too early,
> and the next statement 
> `
> val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
>        stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*)
> `
> will re-compute the stats RDD, it will use expElogbetaBc broadcast variable again,
> so the  expElogbetaBc broadcast variable will be broadcast again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org