You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by jamborta <ja...@gmail.com> on 2014/11/04 11:30:51 UTC

pass unique ID to mllib algorithms pyspark

Hi all, 

There are a few algorithms in pyspark where the prediction part is
implemented in scala (e.g. ALS, decision trees) where it is not very easy to
manipulate the prediction methods. 

I think it is a very common scenario that the user would like to generate
prediction for a datasets, so that each predicted value is identifiable
(e.g. have a unique id attached to it). this is not possible in the current
implementation as predict functions take a feature vector and return the
predicted values where, I believe, the order is not guaranteed, so there is
no way to join it back with the original data the predictions are generated
from. 

Is there a way around this at the moment? 

thanks, 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pass-unique-ID-to-mllib-algorithms-pyspark-tp18051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: pass unique ID to mllib algorithms pyspark

Posted by Tamas Jambor <ja...@gmail.com>.

Hi Xiangrui,

Thanks for the reply. is this still due to be released in 1.2
(SPARK-3530 is still open)?

Thanks,

On Wed, Nov 5, 2014 at 3:21 AM, Xiangrui Meng <me...@gmail.com> wrote:
> The proposed new set of APIs (SPARK-3573, SPARK-3530) will address
> this issue. We "carry over" extra columns with training and prediction
> and then leverage on Spark SQL's execution plan optimization to decide
> which columns are really needed. For the current set of APIs, we can
> add `predictOnValues` to models, which carries over the input keys.
> StreamingKMeans and StreamingLinearRegression implement this method.
> -Xiangrui
>
> On Tue, Nov 4, 2014 at 2:30 AM, jamborta <ja...@gmail.com> wrote:
>> Hi all,
>>
>> There are a few algorithms in pyspark where the prediction part is
>> implemented in scala (e.g. ALS, decision trees) where it is not very easy to
>> manipulate the prediction methods.
>>
>> I think it is a very common scenario that the user would like to generate
>> prediction for a datasets, so that each predicted value is identifiable
>> (e.g. have a unique id attached to it). this is not possible in the current
>> implementation as predict functions take a feature vector and return the
>> predicted values where, I believe, the order is not guaranteed, so there is
>> no way to join it back with the original data the predictions are generated
>> from.
>>
>> Is there a way around this at the moment?
>>
>> thanks,
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pass-unique-ID-to-mllib-algorithms-pyspark-tp18051.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: pass unique ID to mllib algorithms pyspark

Posted by Xiangrui Meng <me...@gmail.com>.

The proposed new set of APIs (SPARK-3573, SPARK-3530) will address
this issue. We "carry over" extra columns with training and prediction
and then leverage on Spark SQL's execution plan optimization to decide
which columns are really needed. For the current set of APIs, we can
add `predictOnValues` to models, which carries over the input keys.
StreamingKMeans and StreamingLinearRegression implement this method.
-Xiangrui

On Tue, Nov 4, 2014 at 2:30 AM, jamborta <ja...@gmail.com> wrote:
> Hi all,
>
> There are a few algorithms in pyspark where the prediction part is
> implemented in scala (e.g. ALS, decision trees) where it is not very easy to
> manipulate the prediction methods.
>
> I think it is a very common scenario that the user would like to generate
> prediction for a datasets, so that each predicted value is identifiable
> (e.g. have a unique id attached to it). this is not possible in the current
> implementation as predict functions take a feature vector and return the
> predicted values where, I believe, the order is not guaranteed, so there is
> no way to join it back with the original data the predictions are generated
> from.
>
> Is there a way around this at the moment?
>
> thanks,
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pass-unique-ID-to-mllib-algorithms-pyspark-tp18051.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org