You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Juan Francisco Contreras Gaitan <ju...@hotmail.com> on 2009/08/28 17:27:02 UTC

String clustering and other newbie questions

Hello,

I would like to do some clustering by using Hadoop and I found Mahout. I am really impressed, but as a newbie I got stuck and I have several questions. The idea is to do string clustering: I have properties values expressed as strings of some resources, and I would like to aggregate these resources. I use Eclipse as IDE, and I have two Mahout working projects, one with release version (0.1) and the other one with SVN version. I am able to compile examples and to run them on my own Hadoop cluster. I have focused on Synthetic Control Data example using Canopy algorithm because of its similarity to my problem.

- on release version with default parameter values I get all the items on the same cluster (C1), is it normal?
- on SVN version I don't have a readable output because there is no implemented OutputDriver. If I use the same as release version, I got exceptions (I think that format has changed between releases, for example using '{' symbol instead of '[')
- I use string values instead of double values. I have implemented my own string distance that returns a double when parameters are string, but I think that Mahout Vectors are implemented just to store double values. Is there any chance to use string values?

I would be very grateful if anyone could help me.

Thank you very much in advance.

Regards,
jfcg

_________________________________________________________________
¿Quieres los nuevos emoticonos en 3D? ¡Descárgatelos gratis!
http://www.vivelive.com/emoticonos3d/index2.html

RE: String clustering and other newbie questions

Posted by Juan Francisco Contreras Gaitan <ju...@hotmail.com>.
Thanks a million for the information.

Cheers,
jfcg

> Date: Fri, 28 Aug 2009 09:37:57 -0700
> From: adil@yahoo-inc.com
> To: mahout-user@lucene.apache.org
> Subject: Re: String clustering and other newbie questions
> 
> Juan Francisco Contreras Gaitan wrote:
> > Hello,
> >
> > I would like to do some clustering by using Hadoop and I found Mahout. I am really impressed, but as a newbie I got stuck and I have several questions. The idea is to do string clustering: I have properties values expressed as strings of some resources, and I would like to aggregate these resources. I use Eclipse as IDE, and I have two Mahout working projects, one with release version (0.1) and the other one with SVN version. I am able to compile examples and to run them on my own Hadoop cluster. I have focused on Synthetic Control Data example using Canopy algorithm because of its similarity to my problem.
> >
> > - on release version with default parameter values I get all the items on the same cluster (C1), is it normal?
> There was an issue with hadoop 0.19 & above running combiners both on 
> the map side and the reduce side which causes this behavior in the 
> released code. Your best bet would be to  use the trunk version.
> 
> adil
> 

_________________________________________________________________
¿Quieres los nuevos emoticonos en 3D? ¡Descárgatelos gratis!
http://www.vivelive.com/emoticonos3d/index2.html

Re: String clustering and other newbie questions

Posted by Adil Aijaz <ad...@yahoo-inc.com>.
Juan Francisco Contreras Gaitan wrote:
> Hello,
>
> I would like to do some clustering by using Hadoop and I found Mahout. I am really impressed, but as a newbie I got stuck and I have several questions. The idea is to do string clustering: I have properties values expressed as strings of some resources, and I would like to aggregate these resources. I use Eclipse as IDE, and I have two Mahout working projects, one with release version (0.1) and the other one with SVN version. I am able to compile examples and to run them on my own Hadoop cluster. I have focused on Synthetic Control Data example using Canopy algorithm because of its similarity to my problem.
>
> - on release version with default parameter values I get all the items on the same cluster (C1), is it normal?
There was an issue with hadoop 0.19 & above running combiners both on 
the map side and the reduce side which causes this behavior in the 
released code. Your best bet would be to  use the trunk version.

adil


RE: String clustering and other newbie questions

Posted by Juan Francisco Contreras Gaitan <ju...@hotmail.com>.
Sorry for the delay. One simplified example could be the following.

Values:

Rolling Stone; Organisation
The Rolling Stones; MusicGroups
Like a Rolling Stone; MusicSongs 
A Rolling Stone; MusicSongs 
Rolling Stone; Magazine

And a sample of distance metric could be Levenshtein distance.

So, between the first item and the following, distances would be 13, 12.2, 10.19, 9. And exactly the same for the following items.

The idea is that if we suppose 3 clusters, I expect to have item 1 in Cluster 1, items 2-3-4 in Cluster 2 and item 5 in Cluster 3.

I hope this could clarify a little bit.

I don't know the algorithm deeply, so I don't know if numerical values has importance apart from distance computation. If not, I think that the idea of Mapping could be enough for our purposes. Could you give me some more information or where to start reading from?

Thank you very much.

Regards,
jfcg

> Date: Fri, 28 Aug 2009 11:09:57 -0700
> From: jdog@windwardsolutions.com
> To: mahout-user@lucene.apache.org
> Subject: Re: String clustering and other newbie questions
> 
> Well, all of the clustering code is based upon clustering points in an 
> n-dimensional vector space and all of the APIs operate upon Vectors. We 
> do support the ability to attach a label binding Map to a Vector which 
> can map Strings into integer index values. Once this has been done you 
> can access the vector values symbolically. I'm not sure this will help 
> with your problem and you may need to write your own Canopy.
> 
> If you can post some examples of the values you wish to cluster and 
> something of your distance measure then I will see if I can figure out a 
> way to help you further.
> 
> Jeff
> 
> 
> 
> Juan Francisco Contreras Gaitan wrote:
> > Thank you so much for your quick reply.
> >
> > Unfortunately, I'm afraid that there is no way of massaging my strings into doubles, because the distance measure would have no sense in terms of doubles. Could you please give me some clue to write the required code in order to solve this difficulty?
> >
> > Thank you very much again.
> >
> > Regards,
> > jfcg
> >
> >   
> >> Date: Fri, 28 Aug 2009 08:49:38 -0700
> >> From: jdog@windwardsolutions.com
> >> To: mahout-user@lucene.apache.org
> >> Subject: Re: String clustering and other newbie questions
> >>
> >> Juan Francisco Contreras Gaitan wrote:
> >>     
> >>> Hello,
> >>>
> >>> I would like to do some clustering by using Hadoop and I found Mahout. I am really impressed, but as a newbie I got stuck and I have several questions. The idea is to do string clustering: I have properties values expressed as strings of some resources, and I would like to aggregate these resources. I use Eclipse as IDE, and I have two Mahout working projects, one with release version (0.1) and the other one with SVN version. I am able to compile examples and to run them on my own Hadoop cluster. I have focused on Synthetic Control Data example using Canopy algorithm because of its similarity to my problem.
> >>>
> >>> - on release version with default parameter values I get all the items on the same cluster (C1), is it normal?
> >>>   
> >>>       
> >> Are you running the Synthetic Control example data here? That example - 
> >> I just ran it on trunk - should produce 6 clusters in one file. It is 
> >> binary encoded though, and difficult to interpret in textual 
> >> representation. If you search for the string 'SparseVector' in the 
> >> canopies/part-0000 file you should see six instances.
> >>     
> >>> - on SVN version I don't have a readable output because there is no implemented OutputDriver. If I use the same as release version, I got exceptions (I think that format has changed between releases, for example using '{' symbol instead of '[')
> >>>   
> >>>       
> >> The output formats of all the clustering routines are now sequence files 
> >> which are binary encoded. The old OutputDriver won't handle it.
> >>     
> >>> - I use string values instead of double values. I have implemented my own string distance that returns a double when parameters are string, but I think that Mahout Vectors are implemented just to store double values. Is there any chance to use string values?
> >>>   
> >>>       
> >> Vectors are double only and you will need to massage your data into 
> >> numeric format to use out of the box clustering. Is there a way to 
> >> convert your property values into doubles?
> >>     
> >>> I would be very grateful if anyone could help me.
> >>>   
> >>>       
> >> I'm going to be working on converting clustering to Hadoop 0.20 in the 
> >> next weeks. Let's continue our dialog.
> >>     
> >>> Thank you very much in advance.
> >>>
> >>> Regards,
> >>> jfcg
> >>>
> >>> _________________________________________________________________
> >>> ¿Quieres los nuevos emoticonos en 3D? ¡Descárgatelos gratis!
> >>> http://www.vivelive.com/emoticonos3d/index2.html
> >>>   
> >>>       
> >
> > _________________________________________________________________
> > Internet Explorer 8 más sencillo y seguro ¡Descárgatelo gratis!
> > http://events.es.msn.com/noticias/internet-explorer-8/
> >   
> 

_________________________________________________________________
Con Vodafone disfruta de Hotmail gratis en tu móvil. ¡Pruébalo!
http://serviciosmoviles.es.msn.com/hotmail/vodafone.aspx

Re: String clustering and other newbie questions

Posted by Isabel Drost <is...@apache.org>.
On Fri, 28 Aug 2009 11:15:22 -0700
Ted Dunning <te...@gmail.com> wrote:
> To cluster strings, you need to have a distance between "centroids"
> and strings. The DP clustering stuff could handle this, but not the
> rest of the clustering.

As an aside: It is possible to formulate k-means (probably canopy as
well?) on centroids. In the current implementation, at least the reduce
step would have to be modified.

Isabel

Re: String clustering and other newbie questions

Posted by Ted Dunning <te...@gmail.com>.
The k-means implementation has the idea of distance between vectors of real
numbers pretty deeply baked into it.  One example of this is that it assumes
that you can take the average (aka centroid) of a set of examples.  Taking
the average of a set of strings in the sense of Levenstein distance would be
difficult.

There is an alternative algorithm called k-medoids which uses on of the
input samples as the centroid, but I would expect that this would give poor
results with Levenstein distance.

It would however, be very reasonable to use bigrams or trigrams as labels on
vector coordinates.  The vector value of a string would be derived by
weighting each bigram or trigram according to the negative log of the
prevalence of that bigram or trigram in your entire corpus.  This
representation would be highly amenable to k-means clustering.  Results
should be relatively good, although inspection of the centroids is likely to
be a bit confusing.

On Tue, Sep 1, 2009 at 5:06 AM, Juan Francisco Contreras Gaitan <
juanfcocontreras@hotmail.com> wrote:

> But if I understood you well, and as far as I know, Mahout has its own
> k-means implementation. Then, could I use it for my purposes instead of DP
> like setup?




-- 
Ted Dunning, CTO
DeepDyve

Re: String clustering and other newbie questions

Posted by Ted Dunning <te...@gmail.com>.
And there is a close correlation between n-gram matching scores and edit
distance scores.

On Tue, Sep 1, 2009 at 11:44 AM, Sean Owen <sr...@gmail.com> wrote:

> Yeah that probably kills the idea doesn't it... the 'best' centroid is well
> defined this way, but, searching for it may be completely unreasonable. I
> see why counts doesn't have this problem.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: String clustering and other newbie questions

Posted by Sean Owen <sr...@gmail.com>.
Yeah that probably kills the idea doesn't it... the 'best' centroid is well
defined this way, but, searching for it may be completely unreasonable. I
see why counts doesn't have this problem.

On Sep 1, 2009 7:17 PM, "Ted Dunning" <te...@gmail.com> wrote:

On Tue, Sep 1, 2009 at 9:44 AM, Sean Owen <sr...@gmail.com> wrote: >
Centroids are just strings th...
Easy to say that.

Very hard to compute.  And the dimensionality is unbounded so the properties
of the centroid are not nice.  You wind up with centroids that are a large
number of edits away from everything and nearly the same distance from
everything.


> ...

> > Anything else that doesn't map? Haven't thought about it a lot but don't
> yet > see why k-means...
Depends on what you mean by well-behaved.  Mathematically speaking, string
edit measures are moderately well behaved.  Computationally and practically,
however, edit distances are not so nice.

Counts of common n-grams are much nicer since they can be interpreted as
vectors.



--
Ted Dunning, CTO
DeepDyve

Re: String clustering and other newbie questions

Posted by Ted Dunning <te...@gmail.com>.
On Tue, Sep 1, 2009 at 9:44 AM, Sean Owen <sr...@gmail.com> wrote:

> Centroids are just strings that are a similar number of edits away from
> another set of strings.
>

Easy to say that.

Very hard to compute.  And the dimensionality is unbounded so the properties
of the centroid are not nice.  You wind up with centroids that are a large
number of edits away from everything and nearly the same distance from
everything.


> ...
>
> Anything else that doesn't map? Haven't thought about it a lot but don't
> yet
> see why k-means couldn't let you cluster strings. In the CF code I do
> something similar for arbitrary 'items' so that hints to me that a well
> behaved distance metric is all you need?
>

Depends on what you mean by well-behaved.  Mathematically speaking, string
edit measures are moderately well behaved.  Computationally and practically,
however, edit distances are not so nice.

Counts of common n-grams are much nicer since they can be interpreted as
vectors.



-- 
Ted Dunning, CTO
DeepDyve

Re: String clustering and other newbie questions

Posted by Sean Owen <sr...@gmail.com>.
If I may attempt to clarify I think - indeed, it makes no sense to have a
vector whose elements are 'string valued', nor can I think of any mapping to
doubles that has any use here.

What he is really after is clustering strings like they are vectors
themselves, not elements of another vector. The question is, how much do we
need to be able to think of strings like vectors to make the algorithm work?

We need a distance metric and he's suggesting Levenshtein, which seems OK at
first glance. (It satisfied the triangle inequality ... I think?)

Centroids are just strings that are a similar number of edits away from
another set of strings.

Distances are discrete, does that matter though?

Anything else that doesn't map? Haven't thought about it a lot but don't yet
see why k-means couldn't let you cluster strings. In the CF code I do
something similar for arbitrary 'items' so that hints to me that a well
behaved distance metric is all you need?

Of course, the code wouldn't quite work as-is to perform this. One would
need to probably modify it a lot.

For what it is worth... you could actually get the TreeClusteringRecommender
class to cluster you strings with just a little work. I am not sure if it
implements the algorithm you want. It is also not distributed.

Sean

On Sep 1, 2009 5:14 PM, "Ted Dunning" <te...@gmail.com> wrote:

That particular trick wouldn't work because you are losing the essence of
real numbers with this step.  If 1.0 refers to one string and 2.0 refers to
another, what does 1.5 refer to?

Better to use trigrams as the labels for the coordinates and weight them by
inverse document frequency.

On Tue, Sep 1, 2009 at 6:28 AM, Juan Francisco Contreras Gaitan <
juanfcocontreras@hotmail.com> wrote:

>
> ... I could use a Map between doubles and strings: storaging doubles in
all

> the algorithm, and retrieving the strings to compute distance in measuring
> steps. >

Re: String clustering and other newbie questions

Posted by Ted Dunning <te...@gmail.com>.
That particular trick wouldn't work because you are losing the essence of
real numbers with this step.  If 1.0 refers to one string and 2.0 refers to
another, what does 1.5 refer to?

Better to use trigrams as the labels for the coordinates and weight them by
inverse document frequency.

On Tue, Sep 1, 2009 at 6:28 AM, Juan Francisco Contreras Gaitan <
juanfcocontreras@hotmail.com> wrote:

>
> ... I could use a Map between doubles and strings: storaging doubles in all
> the algorithm, and retrieving the strings to compute distance in measuring
> steps.
>

RE: String clustering and other newbie questions

Posted by Juan Francisco Contreras Gaitan <ju...@hotmail.com>.
Well, I have reread Ted answer after having a look at some of the information Isabel gave me, and I think you are right. But I am not sure about the reason  k-means mahout algorithm cannot be used with strings, after defining a string distance metric. Taking Jeff's advice, I could use a Map between doubles and strings: storaging doubles in all the algorithm, and retrieving the strings to compute distance in measuring steps. Could it make any sense?

Regards,
jfcg

> Subject: Re: String clustering and other newbie questions
> From: gsingers@apache.org
> Date: Tue, 1 Sep 2009 05:33:34 -0700
> To: mahout-user@lucene.apache.org
> 
> 
> On Sep 1, 2009, at 5:06 AM, Juan Francisco Contreras Gaitan wrote:
> 
> >
> > Ok, I see. Sorry for my unknowledge on these matters (I am going to  
> > read all the documentation you gave me closely).
> >
> > But if I understood you well, and as far as I know, Mahout has its  
> > own k-means implementation. Then, could I use it for my purposes  
> > instead of DP like setup?
> 
> I think Ted was saying that DP is the only one that would work for  
> what you described, but it's also possible we aren't understanding the  
> problem right either.
> 
> Obviously, one of the things we as a project need to develop more is  
> guidelines on which approaches work for which types of problems..
> 
> -Grant

_________________________________________________________________
Con Vodafone disfruta de Hotmail gratis en tu móvil. ¡Pruébalo!
http://serviciosmoviles.es.msn.com/hotmail/vodafone.aspx

Re: String clustering and other newbie questions

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 1, 2009, at 5:06 AM, Juan Francisco Contreras Gaitan wrote:

>
> Ok, I see. Sorry for my unknowledge on these matters (I am going to  
> read all the documentation you gave me closely).
>
> But if I understood you well, and as far as I know, Mahout has its  
> own k-means implementation. Then, could I use it for my purposes  
> instead of DP like setup?

I think Ted was saying that DP is the only one that would work for  
what you described, but it's also possible we aren't understanding the  
problem right either.

Obviously, one of the things we as a project need to develop more is  
guidelines on which approaches work for which types of problems..

-Grant

RE: String clustering and other newbie questions

Posted by Juan Francisco Contreras Gaitan <ju...@hotmail.com>.
Ok, I see. Sorry for my unknowledge on these matters (I am going to read all the documentation you gave me closely).

But if I understood you well, and as far as I know, Mahout has its own k-means implementation. Then, could I use it for my purposes instead of DP like setup?

Thank you very much, Isabel.

Regards,
jfcg

> Date: Tue, 1 Sep 2009 08:23:05 +0200
> From: isabel@apache.org
> To: mahout-user@lucene.apache.org
> Subject: Re: String clustering and other newbie questions
> 
> On Mon, 31 Aug 2009 14:02:08 +0200
> Juan Francisco Contreras Gaitan <ju...@hotmail.com> wrote:
> 
> > Thank you very much for your answer, but I think I can't understand
> > it very well. Could you give me some more details?
> 
> Taking up that question, Ted, please correct me anywhere where I'm
> wrong.
> 
> 
> > For example, what does 'DP' stand for?
> 
> DP stands for Dirichlet Process, sometimes also referred to as "chinese
> restaurant process". There is a nice wikipedia page on dirichlet
> processes themselves: http://en.wikipedia.org/wiki/Dirichlet_process
> 
> An explanation of how they were employed to implement a clustering
> algorithm in Mahout is explained on one of our wiki pages (including
> references to the original papers):
> 
> http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html
> 
> 
> > You can see an example of what I would like to
> > do in my previous answer.
> 
> In a k-Means like setup, you would implement your own distance
> (Levenstein in your case) and use that to assign items to clusters
> during the E(stimation)-step. After that you would employ your own
> implementation of a centroid selection algorithm for recomputing
> cluster-centroids during the M(aximisation)-step.
> 
> In a DP like setup it would look a little different: During the E step
> instead of having k cluster centers, computing distances to these
> clusters and doing hard assignments you would have k cluster models
> and compute a probability of the strings being generated by each
> model. During the M step you would then recompute each cluster model
> based how likely each string was found to be generated by that model.
> To arrive at a final assignment, after the assignment probabilities
> become stable you could choose to assign each point to the model with
> highest probability.
> 
>  
> Isabel

_________________________________________________________________
Messenger cumple 10 años ¡Descárgate ya los nuevos emoticonos!
http://www.vivelive.com/felicidades

Re: String clustering and other newbie questions

Posted by Isabel Drost <is...@apache.org>.
On Mon, 31 Aug 2009 14:02:08 +0200
Juan Francisco Contreras Gaitan <ju...@hotmail.com> wrote:

> Thank you very much for your answer, but I think I can't understand
> it very well. Could you give me some more details?

Taking up that question, Ted, please correct me anywhere where I'm
wrong.


> For example, what does 'DP' stand for?

DP stands for Dirichlet Process, sometimes also referred to as "chinese
restaurant process". There is a nice wikipedia page on dirichlet
processes themselves: http://en.wikipedia.org/wiki/Dirichlet_process

An explanation of how they were employed to implement a clustering
algorithm in Mahout is explained on one of our wiki pages (including
references to the original papers):

http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html


> You can see an example of what I would like to
> do in my previous answer.

In a k-Means like setup, you would implement your own distance
(Levenstein in your case) and use that to assign items to clusters
during the E(stimation)-step. After that you would employ your own
implementation of a centroid selection algorithm for recomputing
cluster-centroids during the M(aximisation)-step.

In a DP like setup it would look a little different: During the E step
instead of having k cluster centers, computing distances to these
clusters and doing hard assignments you would have k cluster models
and compute a probability of the strings being generated by each
model. During the M step you would then recompute each cluster model
based how likely each string was found to be generated by that model.
To arrive at a final assignment, after the assignment probabilities
become stable you could choose to assign each point to the model with
highest probability.

 
Isabel

RE: String clustering and other newbie questions

Posted by Juan Francisco Contreras Gaitan <ju...@hotmail.com>.
Hello Ted,

Thank you very much for your answer, but I think I can't understand it very well. Could you give me some more details? For example, what does 'DP' stand for? You can see an example of what I would like to do in my previous answer.

I'm so sorry for these questions, but I'm starting in this field.

Thank you very much for your time.

Regards,
jfcg

> From: ted.dunning@gmail.com
> Date: Fri, 28 Aug 2009 11:15:22 -0700
> Subject: Re: String clustering and other newbie questions
> To: mahout-user@lucene.apache.org
> 
> To cluster strings, you need to have a distance between "centroids" and
> strings.  The DP clustering stuff could handle this, but not the rest of the
> clustering.  The way that it would work in DP would be that there would be
> parametrized models that describe probabilities of generating strings
> instead of just being multi-dimensional points.  The similarity of a string
> to a model is interpreted as the probability of the string given the model.
> 
> On Fri, Aug 28, 2009 at 11:09 AM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
> 
> > Well, all of the clustering code is based upon clustering points in an
> > n-dimensional vector space and all of the APIs operate upon Vectors. We do
> > support the ability to attach a label binding Map to a Vector which can map
> > Strings into integer index values. Once this has been done you can access
> > the vector values symbolically. I'm not sure this will help with your
> > problem and you may need to write your own Canopy.
> >
> > If you can post some examples of the values you wish to cluster and
> > something of your distance measure then I will see if I can figure out a way
> > to help you further.
> >
> > Jeff
> >
> >
> >
> > Juan Francisco Contreras Gaitan wrote:
> >
> >> Thank you so much for your quick reply.
> >>
> >> Unfortunately, I'm afraid that there is no way of massaging my strings
> >> into doubles, because the distance measure would have no sense in terms of
> >> doubles. Could you please give me some clue to write the required code in
> >> order to solve this difficulty?
> >>
> >> Thank you very much again.
> >>
> >> Regards,
> >> jfcg
> >>
> >>
> >>
> >>> Date: Fri, 28 Aug 2009 08:49:38 -0700
> >>> From: jdog@windwardsolutions.com
> >>> To: mahout-user@lucene.apache.org
> >>> Subject: Re: String clustering and other newbie questions
> >>>
> >>> Juan Francisco Contreras Gaitan wrote:
> >>>
> >>>
> >>>> Hello,
> >>>>
> >>>> I would like to do some clustering by using Hadoop and I found Mahout. I
> >>>> am really impressed, but as a newbie I got stuck and I have several
> >>>> questions. The idea is to do string clustering: I have properties values
> >>>> expressed as strings of some resources, and I would like to aggregate these
> >>>> resources. I use Eclipse as IDE, and I have two Mahout working projects, one
> >>>> with release version (0.1) and the other one with SVN version. I am able to
> >>>> compile examples and to run them on my own Hadoop cluster. I have focused on
> >>>> Synthetic Control Data example using Canopy algorithm because of its
> >>>> similarity to my problem.
> >>>>
> >>>> - on release version with default parameter values I get all the items
> >>>> on the same cluster (C1), is it normal?
> >>>>
> >>>>
> >>> Are you running the Synthetic Control example data here? That example - I
> >>> just ran it on trunk - should produce 6 clusters in one file. It is binary
> >>> encoded though, and difficult to interpret in textual representation. If you
> >>> search for the string 'SparseVector' in the canopies/part-0000 file you
> >>> should see six instances.
> >>>
> >>>
> >>>> - on SVN version I don't have a readable output because there is no
> >>>> implemented OutputDriver. If I use the same as release version, I got
> >>>> exceptions (I think that format has changed between releases, for example
> >>>> using '{' symbol instead of '[')
> >>>>
> >>>>
> >>> The output formats of all the clustering routines are now sequence files
> >>> which are binary encoded. The old OutputDriver won't handle it.
> >>>
> >>>
> >>>> - I use string values instead of double values. I have implemented my
> >>>> own string distance that returns a double when parameters are string, but I
> >>>> think that Mahout Vectors are implemented just to store double values. Is
> >>>> there any chance to use string values?
> >>>>
> >>>>
> >>> Vectors are double only and you will need to massage your data into
> >>> numeric format to use out of the box clustering. Is there a way to convert
> >>> your property values into doubles?
> >>>
> >>>
> >>>> I would be very grateful if anyone could help me.
> >>>>
> >>>>
> >>> I'm going to be working on converting clustering to Hadoop 0.20 in the
> >>> next weeks. Let's continue our dialog.
> >>>
> >>>
> >>>> Thank you very much in advance.
> >>>>
> >>>> Regards,
> >>>> jfcg
> >>>>
> >>>> _________________________________________________________________
> >>>> ¿Quieres los nuevos emoticonos en 3D? ¡Descárgatelos gratis!
> >>>> http://www.vivelive.com/emoticonos3d/index2.html
> >>>>
> >>>>
> >>>
> >> _________________________________________________________________
> >> Internet Explorer 8 más sencillo y seguro ¡Descárgatelo gratis!
> >> http://events.es.msn.com/noticias/internet-explorer-8/
> >>
> >>
> >
> >
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve

_________________________________________________________________
Con Vodafone disfruta de Hotmail gratis en tu móvil. ¡Pruébalo!
http://serviciosmoviles.es.msn.com/hotmail/vodafone.aspx

Re: String clustering and other newbie questions

Posted by Ted Dunning <te...@gmail.com>.
To cluster strings, you need to have a distance between "centroids" and
strings.  The DP clustering stuff could handle this, but not the rest of the
clustering.  The way that it would work in DP would be that there would be
parametrized models that describe probabilities of generating strings
instead of just being multi-dimensional points.  The similarity of a string
to a model is interpreted as the probability of the string given the model.

On Fri, Aug 28, 2009 at 11:09 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> Well, all of the clustering code is based upon clustering points in an
> n-dimensional vector space and all of the APIs operate upon Vectors. We do
> support the ability to attach a label binding Map to a Vector which can map
> Strings into integer index values. Once this has been done you can access
> the vector values symbolically. I'm not sure this will help with your
> problem and you may need to write your own Canopy.
>
> If you can post some examples of the values you wish to cluster and
> something of your distance measure then I will see if I can figure out a way
> to help you further.
>
> Jeff
>
>
>
> Juan Francisco Contreras Gaitan wrote:
>
>> Thank you so much for your quick reply.
>>
>> Unfortunately, I'm afraid that there is no way of massaging my strings
>> into doubles, because the distance measure would have no sense in terms of
>> doubles. Could you please give me some clue to write the required code in
>> order to solve this difficulty?
>>
>> Thank you very much again.
>>
>> Regards,
>> jfcg
>>
>>
>>
>>> Date: Fri, 28 Aug 2009 08:49:38 -0700
>>> From: jdog@windwardsolutions.com
>>> To: mahout-user@lucene.apache.org
>>> Subject: Re: String clustering and other newbie questions
>>>
>>> Juan Francisco Contreras Gaitan wrote:
>>>
>>>
>>>> Hello,
>>>>
>>>> I would like to do some clustering by using Hadoop and I found Mahout. I
>>>> am really impressed, but as a newbie I got stuck and I have several
>>>> questions. The idea is to do string clustering: I have properties values
>>>> expressed as strings of some resources, and I would like to aggregate these
>>>> resources. I use Eclipse as IDE, and I have two Mahout working projects, one
>>>> with release version (0.1) and the other one with SVN version. I am able to
>>>> compile examples and to run them on my own Hadoop cluster. I have focused on
>>>> Synthetic Control Data example using Canopy algorithm because of its
>>>> similarity to my problem.
>>>>
>>>> - on release version with default parameter values I get all the items
>>>> on the same cluster (C1), is it normal?
>>>>
>>>>
>>> Are you running the Synthetic Control example data here? That example - I
>>> just ran it on trunk - should produce 6 clusters in one file. It is binary
>>> encoded though, and difficult to interpret in textual representation. If you
>>> search for the string 'SparseVector' in the canopies/part-0000 file you
>>> should see six instances.
>>>
>>>
>>>> - on SVN version I don't have a readable output because there is no
>>>> implemented OutputDriver. If I use the same as release version, I got
>>>> exceptions (I think that format has changed between releases, for example
>>>> using '{' symbol instead of '[')
>>>>
>>>>
>>> The output formats of all the clustering routines are now sequence files
>>> which are binary encoded. The old OutputDriver won't handle it.
>>>
>>>
>>>> - I use string values instead of double values. I have implemented my
>>>> own string distance that returns a double when parameters are string, but I
>>>> think that Mahout Vectors are implemented just to store double values. Is
>>>> there any chance to use string values?
>>>>
>>>>
>>> Vectors are double only and you will need to massage your data into
>>> numeric format to use out of the box clustering. Is there a way to convert
>>> your property values into doubles?
>>>
>>>
>>>> I would be very grateful if anyone could help me.
>>>>
>>>>
>>> I'm going to be working on converting clustering to Hadoop 0.20 in the
>>> next weeks. Let's continue our dialog.
>>>
>>>
>>>> Thank you very much in advance.
>>>>
>>>> Regards,
>>>> jfcg
>>>>
>>>> _________________________________________________________________
>>>> ¿Quieres los nuevos emoticonos en 3D? ¡Descárgatelos gratis!
>>>> http://www.vivelive.com/emoticonos3d/index2.html
>>>>
>>>>
>>>
>> _________________________________________________________________
>> Internet Explorer 8 más sencillo y seguro ¡Descárgatelo gratis!
>> http://events.es.msn.com/noticias/internet-explorer-8/
>>
>>
>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: String clustering and other newbie questions

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Well, all of the clustering code is based upon clustering points in an 
n-dimensional vector space and all of the APIs operate upon Vectors. We 
do support the ability to attach a label binding Map to a Vector which 
can map Strings into integer index values. Once this has been done you 
can access the vector values symbolically. I'm not sure this will help 
with your problem and you may need to write your own Canopy.

If you can post some examples of the values you wish to cluster and 
something of your distance measure then I will see if I can figure out a 
way to help you further.

Jeff



Juan Francisco Contreras Gaitan wrote:
> Thank you so much for your quick reply.
>
> Unfortunately, I'm afraid that there is no way of massaging my strings into doubles, because the distance measure would have no sense in terms of doubles. Could you please give me some clue to write the required code in order to solve this difficulty?
>
> Thank you very much again.
>
> Regards,
> jfcg
>
>   
>> Date: Fri, 28 Aug 2009 08:49:38 -0700
>> From: jdog@windwardsolutions.com
>> To: mahout-user@lucene.apache.org
>> Subject: Re: String clustering and other newbie questions
>>
>> Juan Francisco Contreras Gaitan wrote:
>>     
>>> Hello,
>>>
>>> I would like to do some clustering by using Hadoop and I found Mahout. I am really impressed, but as a newbie I got stuck and I have several questions. The idea is to do string clustering: I have properties values expressed as strings of some resources, and I would like to aggregate these resources. I use Eclipse as IDE, and I have two Mahout working projects, one with release version (0.1) and the other one with SVN version. I am able to compile examples and to run them on my own Hadoop cluster. I have focused on Synthetic Control Data example using Canopy algorithm because of its similarity to my problem.
>>>
>>> - on release version with default parameter values I get all the items on the same cluster (C1), is it normal?
>>>   
>>>       
>> Are you running the Synthetic Control example data here? That example - 
>> I just ran it on trunk - should produce 6 clusters in one file. It is 
>> binary encoded though, and difficult to interpret in textual 
>> representation. If you search for the string 'SparseVector' in the 
>> canopies/part-0000 file you should see six instances.
>>     
>>> - on SVN version I don't have a readable output because there is no implemented OutputDriver. If I use the same as release version, I got exceptions (I think that format has changed between releases, for example using '{' symbol instead of '[')
>>>   
>>>       
>> The output formats of all the clustering routines are now sequence files 
>> which are binary encoded. The old OutputDriver won't handle it.
>>     
>>> - I use string values instead of double values. I have implemented my own string distance that returns a double when parameters are string, but I think that Mahout Vectors are implemented just to store double values. Is there any chance to use string values?
>>>   
>>>       
>> Vectors are double only and you will need to massage your data into 
>> numeric format to use out of the box clustering. Is there a way to 
>> convert your property values into doubles?
>>     
>>> I would be very grateful if anyone could help me.
>>>   
>>>       
>> I'm going to be working on converting clustering to Hadoop 0.20 in the 
>> next weeks. Let's continue our dialog.
>>     
>>> Thank you very much in advance.
>>>
>>> Regards,
>>> jfcg
>>>
>>> _________________________________________________________________
>>> ¿Quieres los nuevos emoticonos en 3D? ¡Descárgatelos gratis!
>>> http://www.vivelive.com/emoticonos3d/index2.html
>>>   
>>>       
>
> _________________________________________________________________
> Internet Explorer 8 más sencillo y seguro ¡Descárgatelo gratis!
> http://events.es.msn.com/noticias/internet-explorer-8/
>   


RE: String clustering and other newbie questions

Posted by Juan Francisco Contreras Gaitan <ju...@hotmail.com>.
Thank you so much for your quick reply.

Unfortunately, I'm afraid that there is no way of massaging my strings into doubles, because the distance measure would have no sense in terms of doubles. Could you please give me some clue to write the required code in order to solve this difficulty?

Thank you very much again.

Regards,
jfcg

> Date: Fri, 28 Aug 2009 08:49:38 -0700
> From: jdog@windwardsolutions.com
> To: mahout-user@lucene.apache.org
> Subject: Re: String clustering and other newbie questions
> 
> Juan Francisco Contreras Gaitan wrote:
> > Hello,
> >
> > I would like to do some clustering by using Hadoop and I found Mahout. I am really impressed, but as a newbie I got stuck and I have several questions. The idea is to do string clustering: I have properties values expressed as strings of some resources, and I would like to aggregate these resources. I use Eclipse as IDE, and I have two Mahout working projects, one with release version (0.1) and the other one with SVN version. I am able to compile examples and to run them on my own Hadoop cluster. I have focused on Synthetic Control Data example using Canopy algorithm because of its similarity to my problem.
> >
> > - on release version with default parameter values I get all the items on the same cluster (C1), is it normal?
> >   
> Are you running the Synthetic Control example data here? That example - 
> I just ran it on trunk - should produce 6 clusters in one file. It is 
> binary encoded though, and difficult to interpret in textual 
> representation. If you search for the string 'SparseVector' in the 
> canopies/part-0000 file you should see six instances.
> > - on SVN version I don't have a readable output because there is no implemented OutputDriver. If I use the same as release version, I got exceptions (I think that format has changed between releases, for example using '{' symbol instead of '[')
> >   
> The output formats of all the clustering routines are now sequence files 
> which are binary encoded. The old OutputDriver won't handle it.
> > - I use string values instead of double values. I have implemented my own string distance that returns a double when parameters are string, but I think that Mahout Vectors are implemented just to store double values. Is there any chance to use string values?
> >   
> Vectors are double only and you will need to massage your data into 
> numeric format to use out of the box clustering. Is there a way to 
> convert your property values into doubles?
> > I would be very grateful if anyone could help me.
> >   
> I'm going to be working on converting clustering to Hadoop 0.20 in the 
> next weeks. Let's continue our dialog.
> > Thank you very much in advance.
> >
> > Regards,
> > jfcg
> >
> > _________________________________________________________________
> > ¿Quieres los nuevos emoticonos en 3D? ¡Descárgatelos gratis!
> > http://www.vivelive.com/emoticonos3d/index2.html
> >   
> 

_________________________________________________________________
Internet Explorer 8 más sencillo y seguro ¡Descárgatelo gratis!
http://events.es.msn.com/noticias/internet-explorer-8/

Re: String clustering and other newbie questions

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Juan Francisco Contreras Gaitan wrote:
> Hello,
>
> I would like to do some clustering by using Hadoop and I found Mahout. I am really impressed, but as a newbie I got stuck and I have several questions. The idea is to do string clustering: I have properties values expressed as strings of some resources, and I would like to aggregate these resources. I use Eclipse as IDE, and I have two Mahout working projects, one with release version (0.1) and the other one with SVN version. I am able to compile examples and to run them on my own Hadoop cluster. I have focused on Synthetic Control Data example using Canopy algorithm because of its similarity to my problem.
>
> - on release version with default parameter values I get all the items on the same cluster (C1), is it normal?
>   
Are you running the Synthetic Control example data here? That example - 
I just ran it on trunk - should produce 6 clusters in one file. It is 
binary encoded though, and difficult to interpret in textual 
representation. If you search for the string 'SparseVector' in the 
canopies/part-0000 file you should see six instances.
> - on SVN version I don't have a readable output because there is no implemented OutputDriver. If I use the same as release version, I got exceptions (I think that format has changed between releases, for example using '{' symbol instead of '[')
>   
The output formats of all the clustering routines are now sequence files 
which are binary encoded. The old OutputDriver won't handle it.
> - I use string values instead of double values. I have implemented my own string distance that returns a double when parameters are string, but I think that Mahout Vectors are implemented just to store double values. Is there any chance to use string values?
>   
Vectors are double only and you will need to massage your data into 
numeric format to use out of the box clustering. Is there a way to 
convert your property values into doubles?
> I would be very grateful if anyone could help me.
>   
I'm going to be working on converting clustering to Hadoop 0.20 in the 
next weeks. Let's continue our dialog.
> Thank you very much in advance.
>
> Regards,
> jfcg
>
> _________________________________________________________________
> ¿Quieres los nuevos emoticonos en 3D? ¡Descárgatelos gratis!
> http://www.vivelive.com/emoticonos3d/index2.html
>