You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Matt Hicks <ma...@outr.com> on 2018/03/01 19:53:47 UTC

K Means Clustering Explanation

I'm using K Means clustering for a project right now, and it's working very
well.  However, I'd like to determine from the clusters what information
distinctions define each cluster so I can explain the "reasons" data fits into a
specific cluster.
Is there a proper way to do this in Spark ML?

Re: K Means Clustering Explanation

Posted by Alessandro Solimando <al...@gmail.com>.
Hi Matt,
unfortunately I have no code pointer at hand.

I will sketch how to accomplish this via the API, it will for sure at least
help you getting started.

1) ETL + vectorization (I assume your feature vector to be named "features")

2) You run a clustering algorithm (say KMeans:
https://spark.apache.org/docs/2.2.0/ml-clustering.html), and the "fit"
method will add an extra column named "prediction", where each row in the
original dataframe now has a cluster id associated (you can control the
name of the prediction column, as shown in the example, I assume the
default)

3) You run a DT (https://spark.apache.org/docs/2.2.0/ml-classification-
regression.html#decision-tree-classifier), and specify

.setLabelCol("prediction")
> .setFeaturesCol("features")


so that the output of KMeans (the cluster id) is used as a class label by
the classification algorithm, which is again using the feature vector
(always stored in "features" column).

4) For a start, you can visualize the decision tree with "toDebugString()"
method (note that you have indexed feature names, not the original name,
check this for an idea of how to convert the indexes back:
https://stackoverflow.com/questions/36122559/how-to-map-variable-names-to-
features-after-pipeline)

The easiest insight you can get on the classifier is "feature importance",
that can give you an approximate idea of which are the most relevant
features used to classify.

Otherwise you can inspect the model programmatically or manually, but you
have to define precisely what you want to have a look at (coverage,
precision, recall etc.), and rank the tree leaves accordingly (using the
impurity stats of each node, for instance).

Hth,
Alessandro

On 2 March 2018 at 15:42, Matt Hicks <ma...@outr.com> wrote:

> Thanks Alessandro and Christoph.  I appreciate the feedback, but I'm still
> having issues determining how to actually accomplish this with the API.
>
> Can anyone point me to an example in code showing how to accomplish this?
>
>
>
> On Fri, Mar 2, 2018 2:37 AM, Alessandro Solimando
> alessandro.solimando@gmail.com wrote:
>
>> Hi Matt,
>> similarly to what Christoph does, I first derive the cluster id for the
>> elements of my original dataset, and then I use a classification algorithm
>> (cluster ids being the classes here).
>>
>> For this method to be useful you need a "human-readable" model,
>> tree-based models are generally a good choice (e.g., Decision Tree).
>>
>> However, since those models tend to be verbose, you still need a way to
>> summarize them to facilitate readability (there must be some literature on
>> this topic, although I have no pointers to provide).
>>
>> Hth,
>> Alessandro
>>
>>
>>
>>
>>
>> On 1 March 2018 at 21:59, Christoph Brücke <ca...@gmail.com> wrote:
>>
>> Hi Matt,
>>
>> I see. You could use the trained model to predict the cluster id for each
>> training point. Now you should be able to create a dataset with your
>> original input data and the associated cluster id for each data point in
>> the input data. Now you can group this dataset by cluster id and aggregate
>> over the original 5 features. E.g., get the mean for numerical data or the
>> value that occurs the most for categorical data.
>>
>> The exact aggregation is use-case dependent.
>>
>> I hope this helps,
>> Christoph
>>
>> Am 01.03.2018 21:40 schrieb "Matt Hicks" <ma...@outr.com>:
>>
>> Thanks for the response Christoph.
>>
>> I'm converting large amounts of data into clustering training and I'm
>> just having a hard time reasoning about reversing the clusters (in code)
>> back to the original format to properly understand the dominant values in
>> each cluster.
>>
>> For example, if I have five fields of data and I've trained ten clusters
>> of data I'd like to output the five fields of data as represented by each
>> of the ten clusters.
>>
>>
>>
>> On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabolic@gmail.com wrote:
>>
>> Hi matt,
>>
>> the cluster are defined by there centroids / cluster centers. All the
>> points belonging to a certain cluster are closer to its than to the
>> centroids of any other cluster.
>>
>> What I typically do is to convert the cluster centers back to the
>> original input format or of that is not possible use the point nearest to
>> the cluster center and use this as a representation of the whole cluster.
>>
>> Can you be a little bit more specific about your use-case?
>>
>> Best,
>> Christoph
>>
>> Am 01.03.2018 20:53 schrieb "Matt Hicks" <ma...@outr.com>:
>>
>> I'm using K Means clustering for a project right now, and it's working
>> very well.  However, I'd like to determine from the clusters what
>> information distinctions define each cluster so I can explain the "reasons"
>> data fits into a specific cluster.
>>
>> Is there a proper way to do this in Spark ML?
>>
>>
>>
>>

Re: K Means Clustering Explanation

Posted by Matt Hicks <ma...@outr.com>.
Thanks Alessandro and Christoph.  I appreciate the feedback, but I'm still
having issues determining how to actually accomplish this with the API.
Can anyone point me to an example in code showing how to accomplish this?  





On Fri, Mar 2, 2018 2:37 AM, Alessandro Solimando alessandro.solimando@gmail.com 
 wrote:
Hi Matt,similarly to what Christoph does, I first derive the cluster id for the
elements of my original dataset, and then I use a classification algorithm
(cluster ids being the classes here).
For this method to be useful you need a "human-readable" model, tree-based
models are generally a good choice (e.g., Decision Tree).

However, since those models tend to be verbose, you still need a way to
summarize them to facilitate readability (there must be some literature on this
topic, although I have no pointers to provide).
Hth,Alessandro



On 1 March 2018 at 21:59, Christoph Brücke <ca...@gmail.com>  wrote:
Hi Matt,
I see. You could use the trained model to predict the cluster id for each
training point. Now you should be able to create a dataset with your original
input data and the associated cluster id for each data point in the input data.
Now you can group this dataset by cluster id and aggregate over the original 5
features. E.g., get the mean for numerical data or the value that occurs the
most for categorical data.
The exact aggregation is use-case dependent.
I hope this helps,Christoph

Am 01.03.2018 21:40 schrieb "Matt Hicks" <ma...@outr.com>:
Thanks for the response Christoph.
I'm converting large amounts of data into clustering training and I'm just
having a hard time reasoning about reversing the clusters (in code) back to the
original format to properly understand the dominant values in each cluster.
For example, if I have five fields of data and I've trained ten clusters of data
I'd like to output the five fields of data as represented by each of the ten
clusters.  





On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabolic@gmail.com  wrote:
Hi matt,
the cluster are defined by there centroids / cluster centers. All the points
belonging to a certain cluster are closer to its than to the centroids of any
other cluster.
What I typically do is to convert the cluster centers back to the original input
format or of that is not possible use the point nearest to the cluster center
and use this as a representation of the whole cluster.
Can you be a little bit more specific about your use-case?
Best,Christoph
Am 01.03.2018 20:53 schrieb "Matt Hicks" <ma...@outr.com>:
I'm using K Means clustering for a project right now, and it's working very
well.  However, I'd like to determine from the clusters what information
distinctions define each cluster so I can explain the "reasons" data fits into a
specific cluster.
Is there a proper way to do this in Spark ML?

Re: K Means Clustering Explanation

Posted by Alessandro Solimando <al...@gmail.com>.
Hi Matt,
similarly to what Christoph does, I first derive the cluster id for the
elements of my original dataset, and then I use a classification algorithm
(cluster ids being the classes here).

For this method to be useful you need a "human-readable" model, tree-based
models are generally a good choice (e.g., Decision Tree).

However, since those models tend to be verbose, you still need a way to
summarize them to facilitate readability (there must be some literature on
this topic, although I have no pointers to provide).

Hth,
Alessandro





On 1 March 2018 at 21:59, Christoph Brücke <ca...@gmail.com> wrote:

> Hi Matt,
>
> I see. You could use the trained model to predict the cluster id for each
> training point. Now you should be able to create a dataset with your
> original input data and the associated cluster id for each data point in
> the input data. Now you can group this dataset by cluster id and aggregate
> over the original 5 features. E.g., get the mean for numerical data or the
> value that occurs the most for categorical data.
>
> The exact aggregation is use-case dependent.
>
> I hope this helps,
> Christoph
>
> Am 01.03.2018 21:40 schrieb "Matt Hicks" <ma...@outr.com>:
>
> Thanks for the response Christoph.
>
> I'm converting large amounts of data into clustering training and I'm just
> having a hard time reasoning about reversing the clusters (in code) back to
> the original format to properly understand the dominant values in each
> cluster.
>
> For example, if I have five fields of data and I've trained ten clusters
> of data I'd like to output the five fields of data as represented by each
> of the ten clusters.
>
>
>
> On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabolic@gmail.com wrote:
>
>> Hi matt,
>>
>> the cluster are defined by there centroids / cluster centers. All the
>> points belonging to a certain cluster are closer to its than to the
>> centroids of any other cluster.
>>
>> What I typically do is to convert the cluster centers back to the
>> original input format or of that is not possible use the point nearest to
>> the cluster center and use this as a representation of the whole cluster.
>>
>> Can you be a little bit more specific about your use-case?
>>
>> Best,
>> Christoph
>>
>> Am 01.03.2018 20:53 schrieb "Matt Hicks" <ma...@outr.com>:
>>
>> I'm using K Means clustering for a project right now, and it's working
>> very well.  However, I'd like to determine from the clusters what
>> information distinctions define each cluster so I can explain the "reasons"
>> data fits into a specific cluster.
>>
>> Is there a proper way to do this in Spark ML?
>>
>>
>

Re: K Means Clustering Explanation

Posted by Christoph Brücke <ca...@gmail.com>.
Hi Matt,

I see. You could use the trained model to predict the cluster id for each
training point. Now you should be able to create a dataset with your
original input data and the associated cluster id for each data point in
the input data. Now you can group this dataset by cluster id and aggregate
over the original 5 features. E.g., get the mean for numerical data or the
value that occurs the most for categorical data.

The exact aggregation is use-case dependent.

I hope this helps,
Christoph

Am 01.03.2018 21:40 schrieb "Matt Hicks" <ma...@outr.com>:

Thanks for the response Christoph.

I'm converting large amounts of data into clustering training and I'm just
having a hard time reasoning about reversing the clusters (in code) back to
the original format to properly understand the dominant values in each
cluster.

For example, if I have five fields of data and I've trained ten clusters of
data I'd like to output the five fields of data as represented by each of
the ten clusters.



On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabolic@gmail.com wrote:

> Hi matt,
>
> the cluster are defined by there centroids / cluster centers. All the
> points belonging to a certain cluster are closer to its than to the
> centroids of any other cluster.
>
> What I typically do is to convert the cluster centers back to the original
> input format or of that is not possible use the point nearest to the
> cluster center and use this as a representation of the whole cluster.
>
> Can you be a little bit more specific about your use-case?
>
> Best,
> Christoph
>
> Am 01.03.2018 20:53 schrieb "Matt Hicks" <ma...@outr.com>:
>
> I'm using K Means clustering for a project right now, and it's working
> very well.  However, I'd like to determine from the clusters what
> information distinctions define each cluster so I can explain the "reasons"
> data fits into a specific cluster.
>
> Is there a proper way to do this in Spark ML?
>
>