You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by arindam chakraborty <ar...@gmail.com> on 2012/08/12 19:28:48 UTC

Can clustering answer these questions

I am considering clustering (Canopy or k-means) to build a recommender but
I have following uncertainties. If someone can please clarify them, it will
be really helpful.

My vector will be points of 8-dimensions. I will expect the clustering
phase to group close points in respective clusters. The output is where I
am stuck, as to how I can interpret them


   1. Since main aim is to recommend similar objects, assumption is that
   points in the same cluster will be similar. So Is there a RECOMMENDER based
   on the clustering output, or I would have to build that logic manually
   2. Since output will have a list of vectors in one cluster (and they
   will not be unique) how do I identify them. i.e., which resulting point
   means which object, so that I know Object A, B, C are in the same cluster
   or not.
   3. For a new object P, is there a way to find out its cluster, or I will
   have to re-build the clusters all over again
   4. In a cluster, say I do identify an object P somehow, how can I figure
   out the closest n points to it. Is there any built-in method or I would
   have to write my own implementation
   5. Can I provide a data source like a DB to the cluster, so that it can
   work on the changed rows to fit them in their respective clusters. Or I
   would have to rebuild the clusters
   6. Can an object O be added to a cluster in real time? Can I find out
   its closest points from the cluster in real time. [SIMILAR TO POINT 3 & 4 ]
   7. Does the cluster need to be rebuilt on every addition to my source
   data? Or it can identify the delta, and readjust it. Is there a refresh()
   method as there are for Recommenders?


If you can answer one or more questions, it would be very useful.

Re: Can clustering answer these questions

Posted by Julian Ortega <jo...@gmail.com>.

For questions 1 and 2 you might want to look at
https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html,
specifically the rowid and rowsimilarity jobs

On Sun, Aug 12, 2012 at 7:28 PM, arindam chakraborty
<ar...@gmail.com>wrote:

> I am considering clustering (Canopy or k-means) to build a recommender but
> I have following uncertainties. If someone can please clarify them, it will
> be really helpful.
>
> My vector will be points of 8-dimensions. I will expect the clustering
> phase to group close points in respective clusters. The output is where I
> am stuck, as to how I can interpret them
>
>
>    1. Since main aim is to recommend similar objects, assumption is that
>    points in the same cluster will be similar. So Is there a RECOMMENDER
> based
>    on the clustering output, or I would have to build that logic manually
>    2. Since output will have a list of vectors in one cluster (and they
>    will not be unique) how do I identify them. i.e., which resulting point
>    means which object, so that I know Object A, B, C are in the same
> cluster
>    or not.
>    3. For a new object P, is there a way to find out its cluster, or I will
>    have to re-build the clusters all over again
>    4. In a cluster, say I do identify an object P somehow, how can I figure
>    out the closest n points to it. Is there any built-in method or I would
>    have to write my own implementation
>    5. Can I provide a data source like a DB to the cluster, so that it can
>    work on the changed rows to fit them in their respective clusters. Or I
>    would have to rebuild the clusters
>    6. Can an object O be added to a cluster in real time? Can I find out
>    its closest points from the cluster in real time. [SIMILAR TO POINT 3 &
> 4 ]
>    7. Does the cluster need to be rebuilt on every addition to my source
>    data? Or it can identify the delta, and readjust it. Is there a
> refresh()
>    method as there are for Recommenders?
>
>
> If you can answer one or more questions, it would be very useful.
>

Re: Can clustering answer these questions

Posted by arindam chakraborty <ar...@gmail.com>.

Hey Paritosh,

Thanks for the help. That does give a clearer picture

On Mon, Aug 13, 2012 at 7:57 AM, Paritosh Ranjan <pr...@xebia.com> wrote:

> I can try to answer few :
>
> 1) I don't know.
>
> 2) Use org.apache.mahout.math.**NamedVector to identify clusters.
>
> 3) Yes, new points can be identified without clustering all over again. See
> org.apache.mahout.clustering.**classify.ClusterClassifier
> org.apache.mahout.clustering.**iterator.ClusterIterator
> org.apache.mahout.clustering.**classify.**ClusterClassificationDriver
>
> 4) I don't think there is any built in implementation for this.
>
> 5) AFAIK, clustering algorithms take sequence files as input, there is no
> support for DB.
>
> 6) Yes, it is possible. Though you will have to write some code. See
> answer to question 3.
>
> 7) No, there is no refresh method sort of thing.
>
> HTH
>
>
> On 12-08-2012 22:58, arindam chakraborty wrote:
>
>> I am considering clustering (Canopy or k-means) to build a recommender but
>> I have following uncertainties. If someone can please clarify them, it
>> will
>> be really helpful.
>>
>> My vector will be points of 8-dimensions. I will expect the clustering
>> phase to group close points in respective clusters. The output is where I
>> am stuck, as to how I can interpret them
>>
>>
>>     1. Since main aim is to recommend similar objects, assumption is that
>>
>>     points in the same cluster will be similar. So Is there a RECOMMENDER
>> based
>>     on the clustering output, or I would have to build that logic manually
>>     2. Since output will have a list of vectors in one cluster (and they
>>
>>     will not be unique) how do I identify them. i.e., which resulting
>> point
>>     means which object, so that I know Object A, B, C are in the same
>> cluster
>>     or not.
>>     3. For a new object P, is there a way to find out its cluster, or I
>> will
>>
>>     have to re-build the clusters all over again
>>     4. In a cluster, say I do identify an object P somehow, how can I
>> figure
>>
>>     out the closest n points to it. Is there any built-in method or I
>> would
>>     have to write my own implementation
>>     5. Can I provide a data source like a DB to the cluster, so that it
>> can
>>
>>     work on the changed rows to fit them in their respective clusters. Or
>> I
>>     would have to rebuild the clusters
>>     6. Can an object O be added to a cluster in real time? Can I find out
>>
>>     its closest points from the cluster in real time. [SIMILAR TO POINT 3
>> & 4 ]
>>     7. Does the cluster need to be rebuilt on every addition to my source
>>
>>     data? Or it can identify the delta, and readjust it. Is there a
>> refresh()
>>     method as there are for Recommenders?
>>
>>
>> If you can answer one or more questions, it would be very useful.
>>
>>
>
>

Re: Can clustering answer these questions

Posted by Paritosh Ranjan <pr...@xebia.com>.

This should be the way ( I think ). However, I am not that comfortable 
with the dictionary code.
Maybe, someone else more comfortable with dictionary can help here.

On 13-08-2012 16:31, Vikram wrote:
> Paritosh Ranjan <pranjan <at> xebia.com> writes:
>
>> I can try to answer few :
>>
>> 1) I don't know.
>>
>> 2) Use org.apache.mahout.math.NamedVector to identify clusters.
>>
>> 3) Yes, new points can be identified without clustering all over again. See
>> org.apache.mahout.clustering.classify.ClusterClassifier
>> org.apache.mahout.clustering.iterator.ClusterIterator
>> org.apache.mahout.clustering.classify.ClusterClassificationDriver
>>
>> 4) I don't think there is any built in implementation for this.
>>
>> 5) AFAIK, clustering algorithms take sequence files as input, there is
>> no support for DB.
>>
>> 6) Yes, it is possible. Though you will have to write some code. See
>> answer to question 3.
>>
>> 7) No, there is no refresh method sort of thing.
>>
>> HTH
>>
>> On 12-08-2012 22:58, arindam chakraborty wrote:
>>> I am considering clustering (Canopy or k-means) to build a recommender but
>>> I have following uncertainties. If someone can please clarify them, it will
>>> be really helpful.
>>>
>>> My vector will be points of 8-dimensions. I will expect the clustering
>>> phase to group close points in respective clusters. The output is where I
>>> am stuck, as to how I can interpret them
>>>
>>>
>>>      1. Since main aim is to recommend similar objects, assumption is that
>>>      points in the same cluster will be similar. So Is there a RECOMMENDER
> based
>>>      on the clustering output, or I would have to build that logic manually
>>>      2. Since output will have a list of vectors in one cluster (and they
>>>      will not be unique) how do I identify them. i.e., which resulting point
>>>      means which object, so that I know Object A, B, C are in the same
> cluster
>>>      or not.
>>>      3. For a new object P, is there a way to find out its cluster, or I will
>>>      have to re-build the clusters all over again
>>>      4. In a cluster, say I do identify an object P somehow, how can I figure
>>>      out the closest n points to it. Is there any built-in method or I would
>>>      have to write my own implementation
>>>      5. Can I provide a data source like a DB to the cluster, so that it can
>>>      work on the changed rows to fit them in their respective clusters. Or I
>>>      would have to rebuild the clusters
>>>      6. Can an object O be added to a cluster in real time? Can I find out
>>>      its closest points from the cluster in real time. [SIMILAR TO POINT 3 &
> 4 ]
>>>      7. Does the cluster need to be rebuilt on every addition to my source
>>>      data? Or it can identify the delta, and readjust it. Is there a
> refresh()
>>>      method as there are for Recommenders?
>>>
>>>
>>> If you can answer one or more questions, it would be very useful.
>>>
>>
> Regarding your answer to point no. 3, I am assuming that we need to pass in new
> vectors and old clusters so that classifyCluster can put these new vectors into
> already available clusters. I am curious how the new vectors can be generated
> using already existing dictionary. In the sense that, if i had x documents with
> which my dictionary had been prepared. Now, if there are y new documents with
> few new terms which are not in the dictionary, would you merge the dictionary by
> adding the new terms and then create the vector? Is there any utility to do
> this?
>
> Thanks,
> Vikram
>
>

Re: Can clustering answer these questions

Posted by Vikram <nv...@hotmail.com>.

Paritosh Ranjan <pranjan <at> xebia.com> writes:

> 
> I can try to answer few :
> 
> 1) I don't know.
> 
> 2) Use org.apache.mahout.math.NamedVector to identify clusters.
> 
> 3) Yes, new points can be identified without clustering all over again. See
> org.apache.mahout.clustering.classify.ClusterClassifier
> org.apache.mahout.clustering.iterator.ClusterIterator
> org.apache.mahout.clustering.classify.ClusterClassificationDriver
> 
> 4) I don't think there is any built in implementation for this.
> 
> 5) AFAIK, clustering algorithms take sequence files as input, there is 
> no support for DB.
> 
> 6) Yes, it is possible. Though you will have to write some code. See 
> answer to question 3.
> 
> 7) No, there is no refresh method sort of thing.
> 
> HTH
> 
> On 12-08-2012 22:58, arindam chakraborty wrote:
> > I am considering clustering (Canopy or k-means) to build a recommender but
> > I have following uncertainties. If someone can please clarify them, it will
> > be really helpful.
> >
> > My vector will be points of 8-dimensions. I will expect the clustering
> > phase to group close points in respective clusters. The output is where I
> > am stuck, as to how I can interpret them
> >
> >
> >     1. Since main aim is to recommend similar objects, assumption is that
> >     points in the same cluster will be similar. So Is there a RECOMMENDER 
based
> >     on the clustering output, or I would have to build that logic manually
> >     2. Since output will have a list of vectors in one cluster (and they
> >     will not be unique) how do I identify them. i.e., which resulting point
> >     means which object, so that I know Object A, B, C are in the same 
cluster
> >     or not.
> >     3. For a new object P, is there a way to find out its cluster, or I will
> >     have to re-build the clusters all over again
> >     4. In a cluster, say I do identify an object P somehow, how can I figure
> >     out the closest n points to it. Is there any built-in method or I would
> >     have to write my own implementation
> >     5. Can I provide a data source like a DB to the cluster, so that it can
> >     work on the changed rows to fit them in their respective clusters. Or I
> >     would have to rebuild the clusters
> >     6. Can an object O be added to a cluster in real time? Can I find out
> >     its closest points from the cluster in real time. [SIMILAR TO POINT 3 & 
4 ]
> >     7. Does the cluster need to be rebuilt on every addition to my source
> >     data? Or it can identify the delta, and readjust it. Is there a 
refresh()
> >     method as there are for Recommenders?
> >
> >
> > If you can answer one or more questions, it would be very useful.
> >
> 
> 

Regarding your answer to point no. 3, I am assuming that we need to pass in new 
vectors and old clusters so that classifyCluster can put these new vectors into 
already available clusters. I am curious how the new vectors can be generated 
using already existing dictionary. In the sense that, if i had x documents with 
which my dictionary had been prepared. Now, if there are y new documents with 
few new terms which are not in the dictionary, would you merge the dictionary by 
adding the new terms and then create the vector? Is there any utility to do 
this?

Thanks,
Vikram

Re: Can clustering answer these questions

Posted by Paritosh Ranjan <pr...@xebia.com>.

I can try to answer few :

1) I don't know.

2) Use org.apache.mahout.math.NamedVector to identify clusters.

3) Yes, new points can be identified without clustering all over again. See
org.apache.mahout.clustering.classify.ClusterClassifier
org.apache.mahout.clustering.iterator.ClusterIterator
org.apache.mahout.clustering.classify.ClusterClassificationDriver

4) I don't think there is any built in implementation for this.

5) AFAIK, clustering algorithms take sequence files as input, there is 
no support for DB.

6) Yes, it is possible. Though you will have to write some code. See 
answer to question 3.

7) No, there is no refresh method sort of thing.

HTH

On 12-08-2012 22:58, arindam chakraborty wrote:
> I am considering clustering (Canopy or k-means) to build a recommender but
> I have following uncertainties. If someone can please clarify them, it will
> be really helpful.
>
> My vector will be points of 8-dimensions. I will expect the clustering
> phase to group close points in respective clusters. The output is where I
> am stuck, as to how I can interpret them
>
>
>     1. Since main aim is to recommend similar objects, assumption is that
>     points in the same cluster will be similar. So Is there a RECOMMENDER based
>     on the clustering output, or I would have to build that logic manually
>     2. Since output will have a list of vectors in one cluster (and they
>     will not be unique) how do I identify them. i.e., which resulting point
>     means which object, so that I know Object A, B, C are in the same cluster
>     or not.
>     3. For a new object P, is there a way to find out its cluster, or I will
>     have to re-build the clusters all over again
>     4. In a cluster, say I do identify an object P somehow, how can I figure
>     out the closest n points to it. Is there any built-in method or I would
>     have to write my own implementation
>     5. Can I provide a data source like a DB to the cluster, so that it can
>     work on the changed rows to fit them in their respective clusters. Or I
>     would have to rebuild the clusters
>     6. Can an object O be added to a cluster in real time? Can I find out
>     its closest points from the cluster in real time. [SIMILAR TO POINT 3 & 4 ]
>     7. Does the cluster need to be rebuilt on every addition to my source
>     data? Or it can identify the delta, and readjust it. Is there a refresh()
>     method as there are for Recommenders?
>
>
> If you can answer one or more questions, it would be very useful.
>