You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Yanbo Liang <yb...@gmail.com> on 2016/01/02 09:45:11 UTC

Re: Problem embedding GaussianMixtureModel in a closure

Hi Tomasz,

The GMM is bind with the peer Java GMM object, so it need reference to
SparkContext.
Some of MLlib(not ML) models are simple object such as KMeansModel,
LinearRegressionModel etc., but others will refer SparkContext. The later
ones and corresponding member functions should not called in map().

Cheers
Yanbo



2016-01-01 4:12 GMT+08:00 Tomasz Fruboes <To...@ncbj.gov.pl>:

> Dear All,
>
>  I'm trying to implement a procedure that iteratively updates a rdd using
> results from GaussianMixtureModel.predictSoft. In order to avoid problems
> with local variable (the obtained GMM) beeing overwritten in each pass of
> the loop I'm doing the following:
>
> #######################################################
> for i in xrange(10):
>     gmm = GaussianMixture.train(rdd, 2)
>
>     def getSafePredictor(unsafeGMM):
>         return lambda x: \
>             (unsafeGMM.predictSoft(x.features), unsafeGMM.gaussians.mu)
>
>     safePredictor = getSafePredictor(gmm)
>     predictionsRDD = (labelledpointrddselectedfeatsNansPatched
>           .map(safePredictor)
>     )
>     print predictionsRDD.take(1)
>     (... - rest of code - update rdd with results from predictionsRdd)
> #######################################################
>
> Unfortunately this ends with:
>
> #######################################################
> Exception: It appears that you are attempting to reference SparkContext
> from a broadcast variable, action, or transformation. SparkContext can only
> be used on the driver, not in code that it run on workers. For more
> information, see SPARK-5063.
> #######################################################
>
> Any idea why I'm getting this behaviour? My expectation would be, that GMM
> should be a "simple" object without SparkContext in it.  I'm using spark
> 1.5.2
>
>  Thanks,
>    Tomasz
>
>
> ps As a workaround I'm doing currently
>
> ########################
>     def getSafeGMM(unsafeGMM):
>         return lambda x: unsafeGMM.predictSoft(x)
>
>     safeGMM = getSafeGMM(gmm)
>     predictionsRDD = \
>         safeGMM(labelledpointrddselectedfeatsNansPatched.map(rdd))
> ########################
>  which works fine. If it's possible I would like to avoid this approach,
> since it would require to perform another closure on gmm.gaussians later in
> my code
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Problem embedding GaussianMixtureModel in a closure

Posted by Yanbo Liang <yb...@gmail.com>.
Hi Tomasz,

The limitation will not be changed and you will found all the models
reference to SparkContext in the new Spark ML package. It make the Python
API simple for implementation.

But it does not means you can only call this function on local data, you
can operate this function on an RDD like the following code snippet:

gmmModel.predictSoft(rdd)

then you will get a new RDD which is the soft prediction result. And all
the models in ML package follow this rule.

Yanbo

2016-01-04 22:16 GMT+08:00 Tomasz Fruboes <To...@ncbj.gov.pl>:

> Hi Yanbo,
>
>  thanks for info. Is it likely to change in (near :) ) future? Ability to
> call this function only on local data (ie not in rdd) seems to be rather
> serious limitation.
>
>  cheers,
>   Tomasz
>
> On 02.01.2016 09:45, Yanbo Liang wrote:
>
>> Hi Tomasz,
>>
>> The GMM is bind with the peer Java GMM object, so it need reference to
>> SparkContext.
>> Some of MLlib(not ML) models are simple object such as KMeansModel,
>> LinearRegressionModel etc., but others will refer SparkContext. The
>> later ones and corresponding member functions should not called in map().
>>
>> Cheers
>> Yanbo
>>
>>
>>
>> 2016-01-01 4:12 GMT+08:00 Tomasz Fruboes <Tomasz.Fruboes@ncbj.gov.pl
>> <ma...@ncbj.gov.pl>>:
>>
>>     Dear All,
>>
>>       I'm trying to implement a procedure that iteratively updates a rdd
>>     using results from GaussianMixtureModel.predictSoft. In order to
>>     avoid problems with local variable (the obtained GMM) beeing
>>     overwritten in each pass of the loop I'm doing the following:
>>
>>     #######################################################
>>     for i in xrange(10):
>>          gmm = GaussianMixture.train(rdd, 2)
>>
>>          def getSafePredictor(unsafeGMM):
>>              return lambda x: \
>>                  (unsafeGMM.predictSoft(x.features),
>>     unsafeGMM.gaussians.mu <http://unsafeGMM.gaussians.mu>)
>>
>>
>>          safePredictor = getSafePredictor(gmm)
>>          predictionsRDD = (labelledpointrddselectedfeatsNansPatched
>>                .map(safePredictor)
>>          )
>>          print predictionsRDD.take(1)
>>          (... - rest of code - update rdd with results from
>> predictionsRdd)
>>     #######################################################
>>
>>     Unfortunately this ends with:
>>
>>     #######################################################
>>     Exception: It appears that you are attempting to reference
>>     SparkContext from a broadcast variable, action, or transformation.
>>     SparkContext can only be used on the driver, not in code that it run
>>     on workers. For more information, see SPARK-5063.
>>     #######################################################
>>
>>     Any idea why I'm getting this behaviour? My expectation would be,
>>     that GMM should be a "simple" object without SparkContext in it.
>>     I'm using spark 1.5.2
>>
>>       Thanks,
>>         Tomasz
>>
>>
>>     ps As a workaround I'm doing currently
>>
>>     ########################
>>          def getSafeGMM(unsafeGMM):
>>              return lambda x: unsafeGMM.predictSoft(x)
>>
>>          safeGMM = getSafeGMM(gmm)
>>          predictionsRDD = \
>>              safeGMM(labelledpointrddselectedfeatsNansPatched.map(rdd))
>>     ########################
>>       which works fine. If it's possible I would like to avoid this
>>     approach, since it would require to perform another closure on
>>     gmm.gaussians later in my code
>>
>>
>>     ---------------------------------------------------------------------
>>     To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>     <ma...@spark.apache.org>
>>     For additional commands, e-mail: user-help@spark.apache.org
>>     <ma...@spark.apache.org>
>>
>>
>>
>

Re: Problem embedding GaussianMixtureModel in a closure

Posted by Tomasz Fruboes <To...@ncbj.gov.pl>.
Hi Yanbo,

  thanks for info. Is it likely to change in (near :) ) future? Ability 
to call this function only on local data (ie not in rdd) seems to be 
rather serious limitation.

  cheers,
   Tomasz

On 02.01.2016 09:45, Yanbo Liang wrote:
> Hi Tomasz,
>
> The GMM is bind with the peer Java GMM object, so it need reference to
> SparkContext.
> Some of MLlib(not ML) models are simple object such as KMeansModel,
> LinearRegressionModel etc., but others will refer SparkContext. The
> later ones and corresponding member functions should not called in map().
>
> Cheers
> Yanbo
>
>
>
> 2016-01-01 4:12 GMT+08:00 Tomasz Fruboes <Tomasz.Fruboes@ncbj.gov.pl
> <ma...@ncbj.gov.pl>>:
>
>     Dear All,
>
>       I'm trying to implement a procedure that iteratively updates a rdd
>     using results from GaussianMixtureModel.predictSoft. In order to
>     avoid problems with local variable (the obtained GMM) beeing
>     overwritten in each pass of the loop I'm doing the following:
>
>     #######################################################
>     for i in xrange(10):
>          gmm = GaussianMixture.train(rdd, 2)
>
>          def getSafePredictor(unsafeGMM):
>              return lambda x: \
>                  (unsafeGMM.predictSoft(x.features),
>     unsafeGMM.gaussians.mu <http://unsafeGMM.gaussians.mu>)
>
>          safePredictor = getSafePredictor(gmm)
>          predictionsRDD = (labelledpointrddselectedfeatsNansPatched
>                .map(safePredictor)
>          )
>          print predictionsRDD.take(1)
>          (... - rest of code - update rdd with results from predictionsRdd)
>     #######################################################
>
>     Unfortunately this ends with:
>
>     #######################################################
>     Exception: It appears that you are attempting to reference
>     SparkContext from a broadcast variable, action, or transformation.
>     SparkContext can only be used on the driver, not in code that it run
>     on workers. For more information, see SPARK-5063.
>     #######################################################
>
>     Any idea why I'm getting this behaviour? My expectation would be,
>     that GMM should be a "simple" object without SparkContext in it.
>     I'm using spark 1.5.2
>
>       Thanks,
>         Tomasz
>
>
>     ps As a workaround I'm doing currently
>
>     ########################
>          def getSafeGMM(unsafeGMM):
>              return lambda x: unsafeGMM.predictSoft(x)
>
>          safeGMM = getSafeGMM(gmm)
>          predictionsRDD = \
>              safeGMM(labelledpointrddselectedfeatsNansPatched.map(rdd))
>     ########################
>       which works fine. If it's possible I would like to avoid this
>     approach, since it would require to perform another closure on
>     gmm.gaussians later in my code
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>     <ma...@spark.apache.org>
>     For additional commands, e-mail: user-help@spark.apache.org
>     <ma...@spark.apache.org>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org