You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by syed kather <in...@gmail.com> on 2011/12/05 14:09:02 UTC

Clustering on 3D in Mahout

Team,

     Is it possible to clustering in 3D?


I am trying the case like give below.

1.  I am have having solr index with three Fields (SEQID,BODY(content of
Text file),FILEPATH);

Now i need to cluster this Please Help me how to do this is there a way?

            Thanks and Regards,
        S SYED ABDUL KATHER

Re: Clustering on 3D in Mahout

Posted by Fernando Fernández <fe...@gmail.com>.
Sounds weird,

Have you been asked to "cluster documents" or something like that? What is
the goal of that clustering? Using a sequential id and a path of the file
as "variables" make no sense at all. If you have been asked to "cluster
documents" then what you need is a mathematical representation of the
content, in such case, each possible word in the documents become a
dimension (variable) as Cristoph said, so this is much more than 3
dimensions. You can ask lucene to output TF-IDF vectors from the content of
each document (you should search for some info about what TF-IDF is) in a
format that mahout can read.

If you have been literally asked to "cluster with 3 dimensions: seqid,
body, and filepath" then maybe it's not "clustering" what he/she actually
means...

2011/12/5 Christoph Brücke <ch...@campus.tu-berlin.de>

> Hi Syed,
>
> I never used Lucene or Solr myself, so I could just rephrase what's in the
> mahout wiki. So just take a look on how two convert a Lucene (== Solr)
> index to a Mahout compatible vector format [1]. Also have a look at the
> JavaDocs [2], especially KMeansDriver and CosineDistanceMetric. In order to
> use your solr index for clustering (I assume KMeans Clustering, other
> clustering algorithms should work the same way) you create a Sequence File
> from your index as described in the wiki [1]. After that create a new job
> using KMeansDriver by calling the constructor with the input directory,
> output directory, .... and most important the CosineDistanceMetric as
> parameter. That's it, at least the hard part, for the actual clustering you
> just call run on your job and sit back and relax.
>
> To make it clear, I assumed you are using KMeans Clustering and the Body
> text of your solr index. The described process should be applicable for the
> other clustering algorithms as well.
>
> Basically you just need your input data (the data that should be used for
> clustering) as iterable collection of vectors. And a distance metric that
> could be used for your input data. In your case this is the body text of
> the index and cosine distance. And again for your stated use case you don't
> really want just three dimensions, since every word in your body text
> represent a new dimensions, so you have most likely much more than just
> three dimensions.
>
> If I totally missed your point please speak out.
>
>
> So long,
> Christoph
>
> [1]
> https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text#CreatingVectorsfromText-FromLucene
> [2] https://builds.apache.org/job/Mahout-Quality/javadoc/
>
>
> Am 05.12.2011 um 15:15 schrieb syed kather:
>
> > Thanks  Christoph
> > Can you give sample info on clustering on 3D which i can understood .
> >
> > Please help me .. so that i can learn new things. is it poosible using
> > Solr?. If so How can i do that .
> >
> >
> >            Thanks and Regards,
> >        S SYED ABDUL KATHER
> >                9731841519
> >
> >
> > On Mon, Dec 5, 2011 at 7:12 PM, Christoph Brücke <
> > christoph.bruecke@campus.tu-berlin.de> wrote:
> >
> >> Hi Syed,
> >>
> >> to answer your first question, YES mahout is totally capable of
> clustering
> >> in three dimension. However, as far as my knowledge goes with
> >> KMeansClustering, each feature (dimension) has to be the same type.
> Meaning
> >> there has to be one distance metric which is capable of expressing the
> >> distance between every to points. That said i don't think that you can
> >> define a metric which uses seqid, text and text(filepath) as
> coordinates.
> >> But I think you could just use the body of your index and calculate
> >> something like cosine distance to cluster your index entries, as seqid
> is
> >> propably unique to every entry and the file path is not really relevant
> (at
> >> least I can't come up with any suitable use case).
> >>
> >> TL;DR: Yes you can cluster in multiple dimensions as long as you can
> >> define a distance between every pair. You probably better off using just
> >> the body text of your solr index.
> >>
> >> Regards,
> >> Christoph
> >>
> >>
> >> Am 05.12.2011 um 14:09 schrieb syed kather:
> >>
> >>> Team,
> >>>
> >>>    Is it possible to clustering in 3D?
> >>>
> >>>
> >>> I am trying the case like give below.
> >>>
> >>> 1.  I am have having solr index with three Fields (SEQID,BODY(content
> of
> >>> Text file),FILEPATH);
> >>>
> >>> Now i need to cluster this Please Help me how to do this is there a
> way?
> >>>
> >>>           Thanks and Regards,
> >>>       S SYED ABDUL KATHER
> >>
> >>
> >>
>
> Christoph Brücke
> christoph.bruecke@campus.tu-berlin.de
>
>
>
>

Re: Clustering on 3D in Mahout

Posted by Christoph Brücke <ch...@campus.tu-berlin.de>.
Hi Syed,

I never used Lucene or Solr myself, so I could just rephrase what's in the mahout wiki. So just take a look on how two convert a Lucene (== Solr) index to a Mahout compatible vector format [1]. Also have a look at the JavaDocs [2], especially KMeansDriver and CosineDistanceMetric. In order to use your solr index for clustering (I assume KMeans Clustering, other clustering algorithms should work the same way) you create a Sequence File from your index as described in the wiki [1]. After that create a new job using KMeansDriver by calling the constructor with the input directory, output directory, .... and most important the CosineDistanceMetric as parameter. That's it, at least the hard part, for the actual clustering you just call run on your job and sit back and relax.

To make it clear, I assumed you are using KMeans Clustering and the Body text of your solr index. The described process should be applicable for the other clustering algorithms as well.

Basically you just need your input data (the data that should be used for clustering) as iterable collection of vectors. And a distance metric that could be used for your input data. In your case this is the body text of the index and cosine distance. And again for your stated use case you don't really want just three dimensions, since every word in your body text represent a new dimensions, so you have most likely much more than just three dimensions.

If I totally missed your point please speak out.


So long,
Christoph

[1] https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text#CreatingVectorsfromText-FromLucene
[2] https://builds.apache.org/job/Mahout-Quality/javadoc/


Am 05.12.2011 um 15:15 schrieb syed kather:

> Thanks  Christoph
> Can you give sample info on clustering on 3D which i can understood .
> 
> Please help me .. so that i can learn new things. is it poosible using
> Solr?. If so How can i do that .
> 
> 
>            Thanks and Regards,
>        S SYED ABDUL KATHER
>                9731841519
> 
> 
> On Mon, Dec 5, 2011 at 7:12 PM, Christoph Brücke <
> christoph.bruecke@campus.tu-berlin.de> wrote:
> 
>> Hi Syed,
>> 
>> to answer your first question, YES mahout is totally capable of clustering
>> in three dimension. However, as far as my knowledge goes with
>> KMeansClustering, each feature (dimension) has to be the same type. Meaning
>> there has to be one distance metric which is capable of expressing the
>> distance between every to points. That said i don't think that you can
>> define a metric which uses seqid, text and text(filepath) as coordinates.
>> But I think you could just use the body of your index and calculate
>> something like cosine distance to cluster your index entries, as seqid is
>> propably unique to every entry and the file path is not really relevant (at
>> least I can't come up with any suitable use case).
>> 
>> TL;DR: Yes you can cluster in multiple dimensions as long as you can
>> define a distance between every pair. You probably better off using just
>> the body text of your solr index.
>> 
>> Regards,
>> Christoph
>> 
>> 
>> Am 05.12.2011 um 14:09 schrieb syed kather:
>> 
>>> Team,
>>> 
>>>    Is it possible to clustering in 3D?
>>> 
>>> 
>>> I am trying the case like give below.
>>> 
>>> 1.  I am have having solr index with three Fields (SEQID,BODY(content of
>>> Text file),FILEPATH);
>>> 
>>> Now i need to cluster this Please Help me how to do this is there a way?
>>> 
>>>           Thanks and Regards,
>>>       S SYED ABDUL KATHER
>> 
>> 
>> 

Christoph Brücke
christoph.bruecke@campus.tu-berlin.de




Re: Clustering on 3D in Mahout

Posted by syed kather <in...@gmail.com>.
Thanks  Christoph
Can you give sample info on clustering on 3D which i can understood .

Please help me .. so that i can learn new things. is it poosible using
Solr?. If so How can i do that .


            Thanks and Regards,
        S SYED ABDUL KATHER
                9731841519


On Mon, Dec 5, 2011 at 7:12 PM, Christoph Brücke <
christoph.bruecke@campus.tu-berlin.de> wrote:

> Hi Syed,
>
> to answer your first question, YES mahout is totally capable of clustering
> in three dimension. However, as far as my knowledge goes with
> KMeansClustering, each feature (dimension) has to be the same type. Meaning
> there has to be one distance metric which is capable of expressing the
> distance between every to points. That said i don't think that you can
> define a metric which uses seqid, text and text(filepath) as coordinates.
> But I think you could just use the body of your index and calculate
> something like cosine distance to cluster your index entries, as seqid is
> propably unique to every entry and the file path is not really relevant (at
> least I can't come up with any suitable use case).
>
> TL;DR: Yes you can cluster in multiple dimensions as long as you can
> define a distance between every pair. You probably better off using just
> the body text of your solr index.
>
> Regards,
> Christoph
>
>
> Am 05.12.2011 um 14:09 schrieb syed kather:
>
> > Team,
> >
> >     Is it possible to clustering in 3D?
> >
> >
> > I am trying the case like give below.
> >
> > 1.  I am have having solr index with three Fields (SEQID,BODY(content of
> > Text file),FILEPATH);
> >
> > Now i need to cluster this Please Help me how to do this is there a way?
> >
> >            Thanks and Regards,
> >        S SYED ABDUL KATHER
>
>
>

Re: Clustering on 3D in Mahout

Posted by Christoph Brücke <ch...@campus.tu-berlin.de>.
Hi Syed,

to answer your first question, YES mahout is totally capable of clustering in three dimension. However, as far as my knowledge goes with KMeansClustering, each feature (dimension) has to be the same type. Meaning there has to be one distance metric which is capable of expressing the distance between every to points. That said i don't think that you can define a metric which uses seqid, text and text(filepath) as coordinates. But I think you could just use the body of your index and calculate something like cosine distance to cluster your index entries, as seqid is propably unique to every entry and the file path is not really relevant (at least I can't come up with any suitable use case).

TL;DR: Yes you can cluster in multiple dimensions as long as you can define a distance between every pair. You probably better off using just the body text of your solr index.

Regards,
Christoph


Am 05.12.2011 um 14:09 schrieb syed kather:

> Team,
> 
>     Is it possible to clustering in 3D?
> 
> 
> I am trying the case like give below.
> 
> 1.  I am have having solr index with three Fields (SEQID,BODY(content of
> Text file),FILEPATH);
> 
> Now i need to cluster this Please Help me how to do this is there a way?
> 
>            Thanks and Regards,
>        S SYED ABDUL KATHER