You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by John White <de...@gmail.com> on 2014/01/16 09:03:27 UTC

Incremental Clustering from Text Data

Hello,
I use seq2sparse with -wt tfidf option and execute the kmeans pipeline. If
new data comes at a later date, should I decide which cluster it belongs
using "Listing 9.4 News clustering using canopy generation and k-means
clustering" in "Mahout in Action", or is there a better/more generic (i.e.
that can work with other algorithms using text input) way. Specifically I
need a way to access the dictionary and tfidf of the training set data when
testing incrementally.

Re: Incremental Clustering from Text Data

Posted by John White <de...@gmail.com>.

Hi,

Clarifying my question a little bit:

How can I create a vector from a single text document to conform the schema
of the collection of vectors that I created using seq2sparse before?
I want to use it to find the closest centroid to a text document that is
submitted by a client

Best


2014/1/16 John White <de...@gmail.com>

> Hello,
> I use seq2sparse with -wt tfidf option and execute the kmeans pipeline. If
> new data comes at a later date, should I decide which cluster it belongs
> using "Listing 9.4 News clustering using canopy generation and k-means
> clustering" in "Mahout in Action", or is there a better/more generic (i.e.
> that can work with other algorithms using text input) way. Specifically I
> need a way to access the dictionary and tfidf of the training set data when
> testing incrementally.
>