You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Wen Phan <we...@mac.com> on 2014/07/11 16:07:26 UTC

Categorical Features for K-Means Clustering

Hi Folks,

Does any one have experience or recommendations on incorporating categorical features (attributes) into k-means clustering in Spark?  In other words, I want to cluster on a set of attributes that include categorical variables.

I know I could probably implement some custom code to parse and calculate my own similarity function, but I wanted to reach out before I did so.  I’d also prefer to take advantage of the k-means\parallel initialization feature of the model in MLlib, so an MLlib-based implementation would be preferred.

Thanks in advance.

Best,

-Wen

Re: Categorical Features for K-Means Clustering

Posted by Aris <ar...@gmail.com>.

Yeah - another vote here to do what's called One-Hot encoding, just convert
the single categorical feature into N columns, where N is the number of
distinct values of that feature, with a single one and all the other
features/columns set to zero.

On Tue, Sep 16, 2014 at 2:16 PM, Sean Owen <so...@cloudera.com> wrote:

> I think it's on the table but not yet merged?
> https://issues.apache.org/jira/browse/SPARK-1216
>
> On Tue, Sep 16, 2014 at 10:04 PM, st553 <st...@gmail.com> wrote:
> > Does MLlib provide utility functions to do this kind of encoding?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Categorical Features for K-Means Clustering

Posted by Sean Owen <so...@cloudera.com>.

I think it's on the table but not yet merged?
https://issues.apache.org/jira/browse/SPARK-1216

On Tue, Sep 16, 2014 at 10:04 PM, st553 <st...@gmail.com> wrote:
> Does MLlib provide utility functions to do this kind of encoding?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Categorical Features for K-Means Clustering

Posted by st553 <st...@gmail.com>.

Does MLlib provide utility functions to do this kind of encoding?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Categorical Features for K-Means Clustering

Posted by Wen Phan <we...@mac.com>.

I see.  So, basically, kind of like dummy variables like with regressions.  Thanks, Sean.

On Jul 11, 2014, at 10:11 AM, Sean Owen <so...@cloudera.com> wrote:

> Since you can't define your own distance function, you will need to
> convert these to numeric dimensions. 1-of-n encoding can work OK,
> depending on your use case. So a dimension that takes on 3 categorical
> values, becomes 3 dimensions, of which all are 0 except one that has
> value 1.
> 
> On Fri, Jul 11, 2014 at 3:07 PM, Wen Phan <we...@mac.com> wrote:
>> Hi Folks,
>> 
>> Does any one have experience or recommendations on incorporating categorical features (attributes) into k-means clustering in Spark?  In other words, I want to cluster on a set of attributes that include categorical variables.
>> 
>> I know I could probably implement some custom code to parse and calculate my own similarity function, but I wanted to reach out before I did so.  I’d also prefer to take advantage of the k-means\parallel initialization feature of the model in MLlib, so an MLlib-based implementation would be preferred.
>> 
>> Thanks in advance.
>> 
>> Best,
>> 
>> -Wen

Re: Categorical Features for K-Means Clustering

Posted by Sean Owen <so...@cloudera.com>.

Since you can't define your own distance function, you will need to
convert these to numeric dimensions. 1-of-n encoding can work OK,
depending on your use case. So a dimension that takes on 3 categorical
values, becomes 3 dimensions, of which all are 0 except one that has
value 1.

On Fri, Jul 11, 2014 at 3:07 PM, Wen Phan <we...@mac.com> wrote:
> Hi Folks,
>
> Does any one have experience or recommendations on incorporating categorical features (attributes) into k-means clustering in Spark?  In other words, I want to cluster on a set of attributes that include categorical variables.
>
> I know I could probably implement some custom code to parse and calculate my own similarity function, but I wanted to reach out before I did so.  I’d also prefer to take advantage of the k-means\parallel initialization feature of the model in MLlib, so an MLlib-based implementation would be preferred.
>
> Thanks in advance.
>
> Best,
>
> -Wen