You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by jatinpreet <ja...@gmail.com> on 2014/12/29 11:55:09 UTC

Clustering text data with MLlib

Hi,

I wish to cluster a set of textual documents into undefined number of
classes. The clustering algorithm provided in MLlib i.e. K-means requires me
to give a pre-defined number of classes. 

Is there any algorithm which is intelligent enough to identify how many
classes should be made based on the input documents. I want to utilize the
speed and agility of Spark in the process.

Thanks,
Jatin



-----
Novice Big Data Programmer
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Clustering text data with MLlib

Posted by Suneel Marthi <su...@yahoo.com.INVALID>.

Here's the Streaming KMeans from Spark 1.2http://spark.apache.org/docs/latest/mllib-clustering.html#examples-1
Steaming KMeans still needs an initial 'k' to be specified, it then progresses to come up with an optimal 'k' IIRC.

      From: Sean Owen <so...@cloudera.com>
 To: jatinpreet <ja...@gmail.com> 
Cc: "user@spark.apache.org" <us...@spark.apache.org> 
 Sent: Monday, December 29, 2014 6:25 AM
 Subject: Re: Clustering text data with MLlib

You can try several values of k, apply some evaluation metric to the
clustering, and then use that to decide what k is best, or at least
pretty good. If it's a completely unsupervised problem, the metrics
you can use tend to be some function of the inter-cluster and
intra-cluster distances (good clustering means points are near to
things in their own cluster and far from things in other clusters).

If it's a supervised problem, you can bring things like purity or
mutual information, but I don't think that's the case here. You would
have to implement these metrics yourself.

You can consider clustering algorithms that do not depend on k, like
say DBSCAN. Although this has its own different hyperparameter to
pick. Again you'd have to implement it yourself.

What you describe sounds like topic modeling using LDA. This still
requires you to pick a number of topics, but lets documents belong to
several topics. Maybe that's more like what you want. This isn't in
Spark per se but there is some work done on it
(https://issues.apache.org/jira/browse/SPARK-1405) and Sandy has
written up some text on doing this in Spark.

Finally there is the Hierarchical Dirichlet process which does allow
for the number of topics to be learned dynamically. This is relatively
advanced.

Finally finally, maybe someone can remind me of the streaming k-means
variant that tries to pick k dynamically too. I am not finding what
I'm thinking of but think this exists.

On Mon, Dec 29, 2014 at 10:55 AM, jatinpreet <ja...@gmail.com> wrote:
> Hi,
>
> I wish to cluster a set of textual documents into undefined number of
> classes. The clustering algorithm provided in MLlib i.e. K-means requires me
> to give a pre-defined number of classes.
>
> Is there any algorithm which is intelligent enough to identify how many
> classes should be made based on the input documents. I want to utilize the
> speed and agility of Spark in the process.
>
> Thanks,
> Jatin
>
>
>
> -----
> Novice Big Data Programmer
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Clustering text data with MLlib

Posted by Sean Owen <so...@cloudera.com>.

You can try several values of k, apply some evaluation metric to the
clustering, and then use that to decide what k is best, or at least
pretty good. If it's a completely unsupervised problem, the metrics
you can use tend to be some function of the inter-cluster and
intra-cluster distances (good clustering means points are near to
things in their own cluster and far from things in other clusters).

If it's a supervised problem, you can bring things like purity or
mutual information, but I don't think that's the case here. You would
have to implement these metrics yourself.

You can consider clustering algorithms that do not depend on k, like
say DBSCAN. Although this has its own different hyperparameter to
pick. Again you'd have to implement it yourself.

What you describe sounds like topic modeling using LDA. This still
requires you to pick a number of topics, but lets documents belong to
several topics. Maybe that's more like what you want. This isn't in
Spark per se but there is some work done on it
(https://issues.apache.org/jira/browse/SPARK-1405) and Sandy has
written up some text on doing this in Spark.

Finally there is the Hierarchical Dirichlet process which does allow
for the number of topics to be learned dynamically. This is relatively
advanced.

Finally finally, maybe someone can remind me of the streaming k-means
variant that tries to pick k dynamically too. I am not finding what
I'm thinking of but think this exists.

On Mon, Dec 29, 2014 at 10:55 AM, jatinpreet <ja...@gmail.com> wrote:
> Hi,
>
> I wish to cluster a set of textual documents into undefined number of
> classes. The clustering algorithm provided in MLlib i.e. K-means requires me
> to give a pre-defined number of classes.
>
> Is there any algorithm which is intelligent enough to identify how many
> classes should be made based on the input documents. I want to utilize the
> speed and agility of Spark in the process.
>
> Thanks,
> Jatin
>
>
>
> -----
> Novice Big Data Programmer
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Clustering text data with MLlib

Posted by xhudik <xh...@gmail.com>.

Kmeans really needs to have identified number of clusters in advance. There
are multiple algorithms (XMeans, ART,...) which do not need this
information. Unfortunately, none of them is implemented in MLLib for the
moment (you can give a hand and help community).

Anyway, it seems to me you will not be satisfied with those
algorithms(Xmeans, ART,...) either. I understood that what you want to
achieve is precise number of clusters. Notice, whenever you change input
parameters (random seed,...) number of clusters might be different.
Clustering is great tool but it won't give you one true (one number).


regards, Tomas



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883p20899.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Clustering text data with MLlib

Posted by Paco Nathan <ce...@gmail.com>.

Jatin,

One approach for determining K would be to sample the data set and run PCA.
Then evaluate how many many of the resulting eigenvalue/eigenvector pairs
to use before you reach diminishing returns on cumulative error. That
number provides a reasonably good value for K to use in KMeans.

With recent releases of Spark and MLlib, you don't have to sample, could
run PCA at scale on the full data. but that may be overkill for what you
need.

As Sean mentioned there may be other algorithms that would be more
effective for your use case. LDA is good for topic modeling. In practice
its results can be noisy, unless the pipeline has some parsing/processing
of the text ahead of training.

Word2Vec can be an interesting alternative for topic modeling (also in
Spark MLlib) and you may want to take a look at this tutorial/case study
http://www.yseam.com/blog/WV.html

On Mon, Dec 29, 2014 at 2:55 AM, jatinpreet <ja...@gmail.com> wrote:
>
> Hi,
>
> I wish to cluster a set of textual documents into undefined number of
> classes. The clustering algorithm provided in MLlib i.e. K-means requires
> me
> to give a pre-defined number of classes.
>
> Is there any algorithm which is intelligent enough to identify how many
> classes should be made based on the input documents. I want to utilize the
> speed and agility of Spark in the process.
>
> Thanks,
> Jatin
>
>
>
> -----
> Novice Big Data Programmer
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>