You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Lokendra Singh <ls...@gmail.com> on 2011/02/24 18:04:01 UTC

Indexing the clustered data (lucene?)

Hi all,

I was looking for some ways to analyze/index the result ('clusteredPoints')
of k-means clustering.
My input testdata for clustering are the feature vectors extracted from a
dataset of images.

Providing an analogy of my problem to Lucene text indexing: In an Image, if
'n'  feature vectors (image-points)) fall into one cluster, they are
considered similar and it can be considered as a same 'word/term' appearing
'n' times in a text document. At the end, I want to generate TF-IDF vectors
for each image.

Is Lucene well-suited for such a purpose? My idea is to create 'Document'
object for each image with its field "content" as the ClusterID's of the
clusters contained in it, although I am not sure if it is a good approach
since the "content" of the Document Objects (Cluster ID's) will have to be
updated continuously while iterating through the Sequence file contained
inside 'clusteredPoints' directory.
What are the other tools or ways which are generally used by users to
analyze/index a non-textual clustered data ?

PS: Till now. I had been using the native data structures to analyse the
result of clustering but looking for some scalable tools to handle it

Regards
Lokendra

Re: Indexing the clustered data (lucene?)

Posted by sharath jagannath <sh...@gmail.com>.

Try it out with the Zoie <http://sna-projects.com/zoie/>. They claim it to
be real time, even I am playing around with it.

Thanks,
Sharath

On Thu, Feb 24, 2011 at 9:04 AM, Lokendra Singh <ls...@gmail.com>wrote:

> Hi all,
>
> I was looking for some ways to analyze/index the result ('clusteredPoints')
> of k-means clustering.
> My input testdata for clustering are the feature vectors extracted from a
> dataset of images.
>
> Providing an analogy of my problem to Lucene text indexing: In an Image, if
> 'n'  feature vectors (image-points)) fall into one cluster, they are
> considered similar and it can be considered as a same 'word/term' appearing
> 'n' times in a text document. At the end, I want to generate TF-IDF vectors
> for each image.
>
> Is Lucene well-suited for such a purpose? My idea is to create 'Document'
> object for each image with its field "content" as the ClusterID's of the
> clusters contained in it, although I am not sure if it is a good approach
> since the "content" of the Document Objects (Cluster ID's) will have to be
> updated continuously while iterating through the Sequence file contained
> inside 'clusteredPoints' directory.
> What are the other tools or ways which are generally used by users to
> analyze/index a non-textual clustered data ?
>
> PS: Till now. I had been using the native data structures to analyse the
> result of clustering but looking for some scalable tools to handle it
>
> Regards
> Lokendra
>