You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by mercutio7979 <jb...@googlemail.com> on 2013/09/30 23:14:00 UTC

What are the best settings for my clustering task

Hello all,

I am currently trying create clusters from a group of 50.000 strings that
contain product descriptions (around 70-100 characters length each). 

That group of 50.000 consists of roughly 5.000 individual products and ten
varying product descriptions per product. The product descriptions are
already prepared for clustering and contain a normalized brand name, product
model number, etc.

What would be a good approach to maximise the amound of found clusters (the
best possible value would be 5.000 clusters with 10 products each)

I adapted the reuters cluster script to read in my data and managed to
create a first set of clusters. However, I have not managed to maximise the
cluster count. 

The question is: what do I need to tweak with regard to the available mahout
settings, so the clusters are created as precisely as possible?

Many regards!
Jens





--
View this message in context: http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: What are the best settings for my clustering task

Posted by Jens Bonerz <jb...@googlemail.com>.
i'd love to contribute to the community.
however, I am not proficient in java and have no dev environment for that
unfortunately.
if it was made in python. php or c++ that would be something else.

but: if you can steer me into the right direction, I will try to suggest a
patch.



2013/10/6 Ted Dunning <te...@gmail.com>

> It is there, at the very least as part of the streaming k-means code.  The
> abbreviation bkm has been used in the past.
>
> In looking at the code just now I don't find any command line invocation
> of bkm. It should be quite simple to write one and it would be very handy
> to have a way to run streaming k-means without a map reduce step as well.
> As such it might be good to have a new option to streaming k-means to use
> just bkm in a single thread, to use threaded streaming k-means on a single
> machine or to use MapR reduce streaming k-means.
>
> You up for trying to make a patch?
>
> Sent from my iPhone
>
> On Oct 6, 2013, at 12:37, Jens Bonerz <jb...@googlemail.com> wrote:
>
> > Hmmm.. has ballkmeans made it already into the 0.8 release? can't find it
> > in the list of available programs when calling the mahout binary...
> >
> >
> > 2013/10/3 Ted Dunning <te...@gmail.com>
> >
> >> What you are seeing here are the cluster centroids themselves, not the
> >> cluster assignments.
> >>
> >> Streaming k-means is a single pass algorithm to derive these centroids.
> >> Typically, the next step is to cluster these centroids using ball
> k-means.
> >> *Those* results can then be applied back to the original (or new) input
> >> vectors to get cluster assignments for individual input vectors.
> >>
> >> I don't have command line specifics handy, but you seem to have done
> very
> >> well already at figuring out the details.
> >>
> >>
> >> On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote:
> >>
> >>> I created a series of scripts to try out streamingkmeans in mahout an
> >>> increased the number of clusters to a high amount as suggested by Ted.
> >>> Everything seems to work. However, I can't figure out how to access the
> >>> actual cluster data at the end of the process.
> >>>
> >>> It just gives me output that I cannot really understand... I would
> expect
> >>> my product_ids being referenced to cluster ids...
> >>>
> >>> Example of the procedure's output:
> >>>
> >>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running
> >>> locally
> >>> Input Path: file:MahoutCluster/part-r-00000
> >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> >>> org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
> >>> Key: 0: Value: key = 8678, weight = 3.00, vector =
> >>>
> >>
> {37:26.83479118347168,6085:8.162049293518066,4785:10.44443130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307...
> >>> Key: 1: Value: key = 3118, weight = 14.00, vector =
> >>>
> >>
> {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958...
> >>> Key: 2: Value: key = 2867, weight = 3.00, vector =
> >>>
> >>
> {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.611329555511475,7950:8.364990234375,8018:8.68657398223877,15433:7.959...
> >>> Key: 3: Value: key = 6295, weight = 1.00, vector =
> >>>
> >>
> {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1...
> >>> Key: 4: Value: key = 6725, weight = 4.00, vector =
> >>>
> >>
> {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143...
> >>> Key: 5:......
> >>>
> >>>
> >>>
> >>> this is my recipe:
> >>>
> >>>
> >>
> --------------------------------------------------------------------------------------------------------
> >>> Step 1
> >>> Create a seqfile from my data with Python. Its the product_id (key) and
> >> the
> >>> short normalized descripti (value) that is written into the sequence
> >> file.
> >>>
> >>>
> >>>
> >>>
> >>
> --------------------------------------------------------------------------------------------------------
> >>> Step 2
> >>> create vectors from that data with the following command:
> >>>
> >>> mahout seq2sparse \
> >>>  -i productClusterSequenceData/productClusterSequenceData.seq \
> >>>  -o productClusterSequenceData/vectors \
> >>>
> >>>
> >>>
> >>>
> >>
> --------------------------------------------------------------------------------------------------------
> >>> Step 3
> >>> Cluster the vectors using streamingkeans with this command:
> >>>
> >>> mahout streamingkmeans \
> >>> -i productClusterSequenceData/vectors/tfidf-vectors \
> >>> -o MahoutCluster \
> >>> --tempDir /tmp \
> >>> -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
> >>> -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
> >>> -k 10000 -km 500000 \
> >>>
> >>>
> >>>
> >>>
> >>
> --------------------------------------------------------------------------------------------------------
> >>> Step 4
> >>> Export the streamingkmeans cluster data into a textfile (for an extract
> >> of
> >>> the result see above)
> >>>
> >>> mahout seqdumper \
> >>> -i MahoutCluster > similarProducts.txt
> >>>
> >>> What am I missing?
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> 2013/10/3 Ted Dunning <te...@gmail.com>
> >>>
> >>>> Yes.  That will work.
> >>>>
> >>>> The sketch will then contain 10,000 x log N centroids.  If N = 10^9,
> >> log N
> >>>> \approx 30 so the sketch will have at about 300,000 weighted centroids
> >> in
> >>>> it.  The final clustering will have to process these centroids to
> >> produce
> >>>> the desired 5,000 clusters.  Since 300,000 is a relatively small
> number
> >> of
> >>>> data points, this clustering step should proceed relatively quickly.
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <jb...@googlemail.com>
> >>>> wrote:
> >>>>
> >>>>> thx for your elaborate answer.
> >>>>>
> >>>>> so if the upper bound on the final number of clusters is unknown in
> the
> >>>>> beginning, what would happen, if I define a very high number that is
> >>>>> guaranteed to be > the estimated number of clusters.
> >>>>> for example if I set it to 10.000 clusters if an estimate of 5.000 is
> >>>>> likely, will that work?
> >>>>>
> >>>>>
> >>>>> 2013/10/2 Ted Dunning <te...@gmail.com>
> >>>>>
> >>>>>> The way that the new streaming k-means works is that there is a
> first
> >>>>>> sketch pass which only requires an upper bound on the final number
> of
> >>>>>> clusters you will want.  It adaptively creates more or less clusters
> >>>>>> depending on the data and your bound.  This sketch is guaranteed to
> be
> >>>>>> computed within at most one map-reduce pass.  There is a threaded
> >>>> version
> >>>>>> that runs (fast) on a single machine.  The threaded version is
> liable
> >>>> to
> >>>>> be
> >>>>>> faster than the map-reduce version for moderate or smaller data
> sizes.
> >>>>>>
> >>>>>> That sketch can then be used to do all kinds of things that rely on
> >>>>>> Euclidean distance and still get results within a small factor of
> the
> >>>>> same
> >>>>>> algorithm applied to all of the data.  Typically this second phase
> is
> >> a
> >>>>>> ball k-means algorithm, but it could easily be a dp-means algorithm
> >> [1]
> >>>>> if
> >>>>>> you want a variable number of clusters.  Indeed, you could run many
> >>>>>> dp-means passes with different values of lambda on the same sketch.
> >>>> Note
> >>>>>> that the sketch is small enough that in-memory clustering is
> entirely
> >>>>>> viable and is very fast.
> >>>>>>
> >>>>>> For the problem you describe, however, you probably don't need the
> >>>> sketch
> >>>>>> approach at all and can probably apply ball k-means or dp-means
> >>>> directly.
> >>>>>> Running many k-means clusterings with differing values of k should
> be
> >>>>>> entirely feasible as well with such data sizes.
> >>>>>>
> >>>>>> [1]
> http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <jbonerz@googlemail.com
> >
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Isn't the streaming k-means just a different approach to crunch
> >>>> through
> >>>>>> the
> >>>>>>> data? In other words, the result of streaming k-means should be
> >>>>>> comparable
> >>>>>>> to using k-means in multiple chained map reduce cycles?
> >>>>>>>
> >>>>>>> I just read a paper about the k-means clustering and its underlying
> >>>>>>> algorithm.
> >>>>>>>
> >>>>>>> According to that paper, k-means relies on a preknown/predefined
> >>>> amount
> >>>>>> of
> >>>>>>> clusters as an input parameter.
> >>>>>>>
> >>>>>>> Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
> >>>>>>>
> >>>>>>> In my current scenario however, the number of clusters is unknown
> at
> >>>>> the
> >>>>>>> beginning.
> >>>>>>>
> >>>>>>> Maybe k-means is just not the right algorithm for clustering
> similar
> >>>>>>> products based on their short description text? What else could I
> >>>> use?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> 2013/10/1 Ted Dunning <te...@gmail.com>
> >>>>>>>
> >>>>>>>> At such small sizes, I would guess that the sequential version of
> >>>> the
> >>>>>>>> streaming k-means or ball k-means would be better options.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
> >>>>> jbonerz@googlemail.com
> >>>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hello all,
> >>>>>>>>>
> >>>>>>>>> I am currently trying create clusters from a group of 50.000
> >>>>> strings
> >>>>>>> that
> >>>>>>>>> contain product descriptions (around 70-100 characters length
> >>>>> each).
> >>>>>>>>>
> >>>>>>>>> That group of 50.000 consists of roughly 5.000 individual
> >>>> products
> >>>>>> and
> >>>>>>>> ten
> >>>>>>>>> varying product descriptions per product. The product
> >>>> descriptions
> >>>>>> are
> >>>>>>>>> already prepared for clustering and contain a normalized brand
> >>>>> name,
> >>>>>>>>> product
> >>>>>>>>> model number, etc.
> >>>>>>>>>
> >>>>>>>>> What would be a good approach to maximise the amound of found
> >>>>>> clusters
> >>>>>>>> (the
> >>>>>>>>> best possible value would be 5.000 clusters with 10 products
> >>>> each)
> >>>>>>>>>
> >>>>>>>>> I adapted the reuters cluster script to read in my data and
> >>>> managed
> >>>>>> to
> >>>>>>>>> create a first set of clusters. However, I have not managed to
> >>>>>> maximise
> >>>>>>>> the
> >>>>>>>>> cluster count.
> >>>>>>>>>
> >>>>>>>>> The question is: what do I need to tweak with regard to the
> >>>>> available
> >>>>>>>>> mahout
> >>>>>>>>> settings, so the clusters are created as precisely as possible?
> >>>>>>>>>
> >>>>>>>>> Many regards!
> >>>>>>>>> Jens
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> View this message in context:
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> >>>>>>>>> Sent from the Mahout User List mailing list archive at
> >>>> Nabble.com.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
>


<http://www.hightechmg.com>

Re: What are the best settings for my clustering task

Posted by Ted Dunning <te...@gmail.com>.
It is there, at the very least as part of the streaming k-means code.  The abbreviation bkm has been used in the past.  

In looking at the code just now I don't find any command line invocation of bkm. It should be quite simple to write one and it would be very handy to have a way to run streaming k-means without a map reduce step as well. As such it might be good to have a new option to streaming k-means to use just bkm in a single thread, to use threaded streaming k-means on a single machine or to use MapR reduce streaming k-means.  

You up for trying to make a patch?

Sent from my iPhone

On Oct 6, 2013, at 12:37, Jens Bonerz <jb...@googlemail.com> wrote:

> Hmmm.. has ballkmeans made it already into the 0.8 release? can't find it
> in the list of available programs when calling the mahout binary...
> 
> 
> 2013/10/3 Ted Dunning <te...@gmail.com>
> 
>> What you are seeing here are the cluster centroids themselves, not the
>> cluster assignments.
>> 
>> Streaming k-means is a single pass algorithm to derive these centroids.
>> Typically, the next step is to cluster these centroids using ball k-means.
>> *Those* results can then be applied back to the original (or new) input
>> vectors to get cluster assignments for individual input vectors.
>> 
>> I don't have command line specifics handy, but you seem to have done very
>> well already at figuring out the details.
>> 
>> 
>> On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote:
>> 
>>> I created a series of scripts to try out streamingkmeans in mahout an
>>> increased the number of clusters to a high amount as suggested by Ted.
>>> Everything seems to work. However, I can't figure out how to access the
>>> actual cluster data at the end of the process.
>>> 
>>> It just gives me output that I cannot really understand... I would expect
>>> my product_ids being referenced to cluster ids...
>>> 
>>> Example of the procedure's output:
>>> 
>>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running
>>> locally
>>> Input Path: file:MahoutCluster/part-r-00000
>>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
>>> org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
>>> Key: 0: Value: key = 8678, weight = 3.00, vector =
>>> 
>> {37:26.83479118347168,6085:8.162049293518066,4785:10.44443130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307...
>>> Key: 1: Value: key = 3118, weight = 14.00, vector =
>>> 
>> {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958...
>>> Key: 2: Value: key = 2867, weight = 3.00, vector =
>>> 
>> {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.611329555511475,7950:8.364990234375,8018:8.68657398223877,15433:7.959...
>>> Key: 3: Value: key = 6295, weight = 1.00, vector =
>>> 
>> {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1...
>>> Key: 4: Value: key = 6725, weight = 4.00, vector =
>>> 
>> {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143...
>>> Key: 5:......
>>> 
>>> 
>>> 
>>> this is my recipe:
>>> 
>>> 
>> --------------------------------------------------------------------------------------------------------
>>> Step 1
>>> Create a seqfile from my data with Python. Its the product_id (key) and
>> the
>>> short normalized descripti (value) that is written into the sequence
>> file.
>>> 
>>> 
>>> 
>>> 
>> --------------------------------------------------------------------------------------------------------
>>> Step 2
>>> create vectors from that data with the following command:
>>> 
>>> mahout seq2sparse \
>>>  -i productClusterSequenceData/productClusterSequenceData.seq \
>>>  -o productClusterSequenceData/vectors \
>>> 
>>> 
>>> 
>>> 
>> --------------------------------------------------------------------------------------------------------
>>> Step 3
>>> Cluster the vectors using streamingkeans with this command:
>>> 
>>> mahout streamingkmeans \
>>> -i productClusterSequenceData/vectors/tfidf-vectors \
>>> -o MahoutCluster \
>>> --tempDir /tmp \
>>> -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
>>> -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
>>> -k 10000 -km 500000 \
>>> 
>>> 
>>> 
>>> 
>> --------------------------------------------------------------------------------------------------------
>>> Step 4
>>> Export the streamingkmeans cluster data into a textfile (for an extract
>> of
>>> the result see above)
>>> 
>>> mahout seqdumper \
>>> -i MahoutCluster > similarProducts.txt
>>> 
>>> What am I missing?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 2013/10/3 Ted Dunning <te...@gmail.com>
>>> 
>>>> Yes.  That will work.
>>>> 
>>>> The sketch will then contain 10,000 x log N centroids.  If N = 10^9,
>> log N
>>>> \approx 30 so the sketch will have at about 300,000 weighted centroids
>> in
>>>> it.  The final clustering will have to process these centroids to
>> produce
>>>> the desired 5,000 clusters.  Since 300,000 is a relatively small number
>> of
>>>> data points, this clustering step should proceed relatively quickly.
>>>> 
>>>> 
>>>> 
>>>> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <jb...@googlemail.com>
>>>> wrote:
>>>> 
>>>>> thx for your elaborate answer.
>>>>> 
>>>>> so if the upper bound on the final number of clusters is unknown in the
>>>>> beginning, what would happen, if I define a very high number that is
>>>>> guaranteed to be > the estimated number of clusters.
>>>>> for example if I set it to 10.000 clusters if an estimate of 5.000 is
>>>>> likely, will that work?
>>>>> 
>>>>> 
>>>>> 2013/10/2 Ted Dunning <te...@gmail.com>
>>>>> 
>>>>>> The way that the new streaming k-means works is that there is a first
>>>>>> sketch pass which only requires an upper bound on the final number of
>>>>>> clusters you will want.  It adaptively creates more or less clusters
>>>>>> depending on the data and your bound.  This sketch is guaranteed to be
>>>>>> computed within at most one map-reduce pass.  There is a threaded
>>>> version
>>>>>> that runs (fast) on a single machine.  The threaded version is liable
>>>> to
>>>>> be
>>>>>> faster than the map-reduce version for moderate or smaller data sizes.
>>>>>> 
>>>>>> That sketch can then be used to do all kinds of things that rely on
>>>>>> Euclidean distance and still get results within a small factor of the
>>>>> same
>>>>>> algorithm applied to all of the data.  Typically this second phase is
>> a
>>>>>> ball k-means algorithm, but it could easily be a dp-means algorithm
>> [1]
>>>>> if
>>>>>> you want a variable number of clusters.  Indeed, you could run many
>>>>>> dp-means passes with different values of lambda on the same sketch.
>>>> Note
>>>>>> that the sketch is small enough that in-memory clustering is entirely
>>>>>> viable and is very fast.
>>>>>> 
>>>>>> For the problem you describe, however, you probably don't need the
>>>> sketch
>>>>>> approach at all and can probably apply ball k-means or dp-means
>>>> directly.
>>>>>> Running many k-means clusterings with differing values of k should be
>>>>>> entirely feasible as well with such data sizes.
>>>>>> 
>>>>>> [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <jb...@googlemail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Isn't the streaming k-means just a different approach to crunch
>>>> through
>>>>>> the
>>>>>>> data? In other words, the result of streaming k-means should be
>>>>>> comparable
>>>>>>> to using k-means in multiple chained map reduce cycles?
>>>>>>> 
>>>>>>> I just read a paper about the k-means clustering and its underlying
>>>>>>> algorithm.
>>>>>>> 
>>>>>>> According to that paper, k-means relies on a preknown/predefined
>>>> amount
>>>>>> of
>>>>>>> clusters as an input parameter.
>>>>>>> 
>>>>>>> Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
>>>>>>> 
>>>>>>> In my current scenario however, the number of clusters is unknown at
>>>>> the
>>>>>>> beginning.
>>>>>>> 
>>>>>>> Maybe k-means is just not the right algorithm for clustering similar
>>>>>>> products based on their short description text? What else could I
>>>> use?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 2013/10/1 Ted Dunning <te...@gmail.com>
>>>>>>> 
>>>>>>>> At such small sizes, I would guess that the sequential version of
>>>> the
>>>>>>>> streaming k-means or ball k-means would be better options.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
>>>>> jbonerz@googlemail.com
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hello all,
>>>>>>>>> 
>>>>>>>>> I am currently trying create clusters from a group of 50.000
>>>>> strings
>>>>>>> that
>>>>>>>>> contain product descriptions (around 70-100 characters length
>>>>> each).
>>>>>>>>> 
>>>>>>>>> That group of 50.000 consists of roughly 5.000 individual
>>>> products
>>>>>> and
>>>>>>>> ten
>>>>>>>>> varying product descriptions per product. The product
>>>> descriptions
>>>>>> are
>>>>>>>>> already prepared for clustering and contain a normalized brand
>>>>> name,
>>>>>>>>> product
>>>>>>>>> model number, etc.
>>>>>>>>> 
>>>>>>>>> What would be a good approach to maximise the amound of found
>>>>>> clusters
>>>>>>>> (the
>>>>>>>>> best possible value would be 5.000 clusters with 10 products
>>>> each)
>>>>>>>>> 
>>>>>>>>> I adapted the reuters cluster script to read in my data and
>>>> managed
>>>>>> to
>>>>>>>>> create a first set of clusters. However, I have not managed to
>>>>>> maximise
>>>>>>>> the
>>>>>>>>> cluster count.
>>>>>>>>> 
>>>>>>>>> The question is: what do I need to tweak with regard to the
>>>>> available
>>>>>>>>> mahout
>>>>>>>>> settings, so the clusters are created as precisely as possible?
>>>>>>>>> 
>>>>>>>>> Many regards!
>>>>>>>>> Jens
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
>>>>>>>>> Sent from the Mahout User List mailing list archive at
>>>> Nabble.com.
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 

Re: What are the best settings for my clustering task

Posted by Jens Bonerz <jb...@googlemail.com>.
Hmmm.. has ballkmeans made it already into the 0.8 release? can't find it
in the list of available programs when calling the mahout binary...


2013/10/3 Ted Dunning <te...@gmail.com>

> What you are seeing here are the cluster centroids themselves, not the
> cluster assignments.
>
> Streaming k-means is a single pass algorithm to derive these centroids.
>  Typically, the next step is to cluster these centroids using ball k-means.
>  *Those* results can then be applied back to the original (or new) input
> vectors to get cluster assignments for individual input vectors.
>
> I don't have command line specifics handy, but you seem to have done very
> well already at figuring out the details.
>
>
> On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote:
>
> > I created a series of scripts to try out streamingkmeans in mahout an
> > increased the number of clusters to a high amount as suggested by Ted.
> > Everything seems to work. However, I can't figure out how to access the
> > actual cluster data at the end of the process.
> >
> > It just gives me output that I cannot really understand... I would expect
> > my product_ids being referenced to cluster ids...
> >
> > Example of the procedure's output:
> >
> > hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running
> > locally
> > Input Path: file:MahoutCluster/part-r-00000
> > Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> > org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
> > Key: 0: Value: key = 8678, weight = 3.00, vector =
> >
> {37:26.83479118347168,6085:8.162049293518066,4785:10.44443130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307...
> > Key: 1: Value: key = 3118, weight = 14.00, vector =
> >
> {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958...
> > Key: 2: Value: key = 2867, weight = 3.00, vector =
> >
> {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.611329555511475,7950:8.364990234375,8018:8.68657398223877,15433:7.959...
> > Key: 3: Value: key = 6295, weight = 1.00, vector =
> >
> {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1...
> > Key: 4: Value: key = 6725, weight = 4.00, vector =
> >
> {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143...
> > Key: 5:......
> >
> >
> >
> > this is my recipe:
> >
> >
> --------------------------------------------------------------------------------------------------------
> > Step 1
> > Create a seqfile from my data with Python. Its the product_id (key) and
> the
> > short normalized descripti (value) that is written into the sequence
> file.
> >
> >
> >
> >
> --------------------------------------------------------------------------------------------------------
> > Step 2
> > create vectors from that data with the following command:
> >
> > mahout seq2sparse \
> >   -i productClusterSequenceData/productClusterSequenceData.seq \
> >   -o productClusterSequenceData/vectors \
> >
> >
> >
> >
> --------------------------------------------------------------------------------------------------------
> > Step 3
> > Cluster the vectors using streamingkeans with this command:
> >
> > mahout streamingkmeans \
> > -i productClusterSequenceData/vectors/tfidf-vectors \
> > -o MahoutCluster \
> > --tempDir /tmp \
> > -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
> > -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
> > -k 10000 -km 500000 \
> >
> >
> >
> >
> --------------------------------------------------------------------------------------------------------
> > Step 4
> > Export the streamingkmeans cluster data into a textfile (for an extract
> of
> > the result see above)
> >
> > mahout seqdumper \
> > -i MahoutCluster > similarProducts.txt
> >
> > What am I missing?
> >
> >
> >
> >
> >
> > 2013/10/3 Ted Dunning <te...@gmail.com>
> >
> >> Yes.  That will work.
> >>
> >> The sketch will then contain 10,000 x log N centroids.  If N = 10^9,
> log N
> >> \approx 30 so the sketch will have at about 300,000 weighted centroids
> in
> >> it.  The final clustering will have to process these centroids to
> produce
> >> the desired 5,000 clusters.  Since 300,000 is a relatively small number
> of
> >> data points, this clustering step should proceed relatively quickly.
> >>
> >>
> >>
> >> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <jb...@googlemail.com>
> >> wrote:
> >>
> >>> thx for your elaborate answer.
> >>>
> >>> so if the upper bound on the final number of clusters is unknown in the
> >>> beginning, what would happen, if I define a very high number that is
> >>> guaranteed to be > the estimated number of clusters.
> >>> for example if I set it to 10.000 clusters if an estimate of 5.000 is
> >>> likely, will that work?
> >>>
> >>>
> >>> 2013/10/2 Ted Dunning <te...@gmail.com>
> >>>
> >>>> The way that the new streaming k-means works is that there is a first
> >>>> sketch pass which only requires an upper bound on the final number of
> >>>> clusters you will want.  It adaptively creates more or less clusters
> >>>> depending on the data and your bound.  This sketch is guaranteed to be
> >>>> computed within at most one map-reduce pass.  There is a threaded
> >> version
> >>>> that runs (fast) on a single machine.  The threaded version is liable
> >> to
> >>> be
> >>>> faster than the map-reduce version for moderate or smaller data sizes.
> >>>>
> >>>> That sketch can then be used to do all kinds of things that rely on
> >>>> Euclidean distance and still get results within a small factor of the
> >>> same
> >>>> algorithm applied to all of the data.  Typically this second phase is
> a
> >>>> ball k-means algorithm, but it could easily be a dp-means algorithm
> [1]
> >>> if
> >>>> you want a variable number of clusters.  Indeed, you could run many
> >>>> dp-means passes with different values of lambda on the same sketch.
> >> Note
> >>>> that the sketch is small enough that in-memory clustering is entirely
> >>>> viable and is very fast.
> >>>>
> >>>> For the problem you describe, however, you probably don't need the
> >> sketch
> >>>> approach at all and can probably apply ball k-means or dp-means
> >> directly.
> >>>> Running many k-means clusterings with differing values of k should be
> >>>> entirely feasible as well with such data sizes.
> >>>>
> >>>> [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <jb...@googlemail.com>
> >>>> wrote:
> >>>>
> >>>>> Isn't the streaming k-means just a different approach to crunch
> >> through
> >>>> the
> >>>>> data? In other words, the result of streaming k-means should be
> >>>> comparable
> >>>>> to using k-means in multiple chained map reduce cycles?
> >>>>>
> >>>>> I just read a paper about the k-means clustering and its underlying
> >>>>> algorithm.
> >>>>>
> >>>>> According to that paper, k-means relies on a preknown/predefined
> >> amount
> >>>> of
> >>>>> clusters as an input parameter.
> >>>>>
> >>>>> Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
> >>>>>
> >>>>> In my current scenario however, the number of clusters is unknown at
> >>> the
> >>>>> beginning.
> >>>>>
> >>>>> Maybe k-means is just not the right algorithm for clustering similar
> >>>>> products based on their short description text? What else could I
> >> use?
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2013/10/1 Ted Dunning <te...@gmail.com>
> >>>>>
> >>>>>> At such small sizes, I would guess that the sequential version of
> >> the
> >>>>>> streaming k-means or ball k-means would be better options.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
> >>> jbonerz@googlemail.com
> >>>>>>> wrote:
> >>>>>>
> >>>>>>> Hello all,
> >>>>>>>
> >>>>>>> I am currently trying create clusters from a group of 50.000
> >>> strings
> >>>>> that
> >>>>>>> contain product descriptions (around 70-100 characters length
> >>> each).
> >>>>>>>
> >>>>>>> That group of 50.000 consists of roughly 5.000 individual
> >> products
> >>>> and
> >>>>>> ten
> >>>>>>> varying product descriptions per product. The product
> >> descriptions
> >>>> are
> >>>>>>> already prepared for clustering and contain a normalized brand
> >>> name,
> >>>>>>> product
> >>>>>>> model number, etc.
> >>>>>>>
> >>>>>>> What would be a good approach to maximise the amound of found
> >>>> clusters
> >>>>>> (the
> >>>>>>> best possible value would be 5.000 clusters with 10 products
> >> each)
> >>>>>>>
> >>>>>>> I adapted the reuters cluster script to read in my data and
> >> managed
> >>>> to
> >>>>>>> create a first set of clusters. However, I have not managed to
> >>>> maximise
> >>>>>> the
> >>>>>>> cluster count.
> >>>>>>>
> >>>>>>> The question is: what do I need to tweak with regard to the
> >>> available
> >>>>>>> mahout
> >>>>>>> settings, so the clusters are created as precisely as possible?
> >>>>>>>
> >>>>>>> Many regards!
> >>>>>>> Jens
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> View this message in context:
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> >>>>>>> Sent from the Mahout User List mailing list archive at
> >> Nabble.com.
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>

Re: What are the best settings for my clustering task

Posted by Ted Dunning <te...@gmail.com>.
What you are seeing here are the cluster centroids themselves, not the cluster assignments.

Streaming k-means is a single pass algorithm to derive these centroids.  Typically, the next step is to cluster these centroids using ball k-means.  *Those* results can then be applied back to the original (or new) input vectors to get cluster assignments for individual input vectors.

I don't have command line specifics handy, but you seem to have done very well already at figuring out the details.


On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote:

> I created a series of scripts to try out streamingkmeans in mahout an
> increased the number of clusters to a high amount as suggested by Ted.
> Everything seems to work. However, I can't figure out how to access the
> actual cluster data at the end of the process.
> 
> It just gives me output that I cannot really understand... I would expect
> my product_ids being referenced to cluster ids...
> 
> Example of the procedure's output:
> 
> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running
> locally
> Input Path: file:MahoutCluster/part-r-00000
> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
> Key: 0: Value: key = 8678, weight = 3.00, vector =
> {37:26.83479118347168,6085:8.162049293518066,4785:10.44443130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307...
> Key: 1: Value: key = 3118, weight = 14.00, vector =
> {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958...
> Key: 2: Value: key = 2867, weight = 3.00, vector =
> {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.611329555511475,7950:8.364990234375,8018:8.68657398223877,15433:7.959...
> Key: 3: Value: key = 6295, weight = 1.00, vector =
> {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1...
> Key: 4: Value: key = 6725, weight = 4.00, vector =
> {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143...
> Key: 5:......
> 
> 
> 
> this is my recipe:
> 
> --------------------------------------------------------------------------------------------------------
> Step 1
> Create a seqfile from my data with Python. Its the product_id (key) and the
> short normalized descripti (value) that is written into the sequence file.
> 
> 
> 
> --------------------------------------------------------------------------------------------------------
> Step 2
> create vectors from that data with the following command:
> 
> mahout seq2sparse \
>   -i productClusterSequenceData/productClusterSequenceData.seq \
>   -o productClusterSequenceData/vectors \
> 
> 
> 
> --------------------------------------------------------------------------------------------------------
> Step 3
> Cluster the vectors using streamingkeans with this command:
> 
> mahout streamingkmeans \
> -i productClusterSequenceData/vectors/tfidf-vectors \
> -o MahoutCluster \
> --tempDir /tmp \
> -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
> -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
> -k 10000 -km 500000 \
> 
> 
> 
> --------------------------------------------------------------------------------------------------------
> Step 4
> Export the streamingkmeans cluster data into a textfile (for an extract of
> the result see above)
> 
> mahout seqdumper \
> -i MahoutCluster > similarProducts.txt
> 
> What am I missing?
> 
> 
> 
> 
> 
> 2013/10/3 Ted Dunning <te...@gmail.com>
> 
>> Yes.  That will work.
>> 
>> The sketch will then contain 10,000 x log N centroids.  If N = 10^9, log N
>> \approx 30 so the sketch will have at about 300,000 weighted centroids in
>> it.  The final clustering will have to process these centroids to produce
>> the desired 5,000 clusters.  Since 300,000 is a relatively small number of
>> data points, this clustering step should proceed relatively quickly.
>> 
>> 
>> 
>> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <jb...@googlemail.com>
>> wrote:
>> 
>>> thx for your elaborate answer.
>>> 
>>> so if the upper bound on the final number of clusters is unknown in the
>>> beginning, what would happen, if I define a very high number that is
>>> guaranteed to be > the estimated number of clusters.
>>> for example if I set it to 10.000 clusters if an estimate of 5.000 is
>>> likely, will that work?
>>> 
>>> 
>>> 2013/10/2 Ted Dunning <te...@gmail.com>
>>> 
>>>> The way that the new streaming k-means works is that there is a first
>>>> sketch pass which only requires an upper bound on the final number of
>>>> clusters you will want.  It adaptively creates more or less clusters
>>>> depending on the data and your bound.  This sketch is guaranteed to be
>>>> computed within at most one map-reduce pass.  There is a threaded
>> version
>>>> that runs (fast) on a single machine.  The threaded version is liable
>> to
>>> be
>>>> faster than the map-reduce version for moderate or smaller data sizes.
>>>> 
>>>> That sketch can then be used to do all kinds of things that rely on
>>>> Euclidean distance and still get results within a small factor of the
>>> same
>>>> algorithm applied to all of the data.  Typically this second phase is a
>>>> ball k-means algorithm, but it could easily be a dp-means algorithm [1]
>>> if
>>>> you want a variable number of clusters.  Indeed, you could run many
>>>> dp-means passes with different values of lambda on the same sketch.
>> Note
>>>> that the sketch is small enough that in-memory clustering is entirely
>>>> viable and is very fast.
>>>> 
>>>> For the problem you describe, however, you probably don't need the
>> sketch
>>>> approach at all and can probably apply ball k-means or dp-means
>> directly.
>>>> Running many k-means clusterings with differing values of k should be
>>>> entirely feasible as well with such data sizes.
>>>> 
>>>> [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <jb...@googlemail.com>
>>>> wrote:
>>>> 
>>>>> Isn't the streaming k-means just a different approach to crunch
>> through
>>>> the
>>>>> data? In other words, the result of streaming k-means should be
>>>> comparable
>>>>> to using k-means in multiple chained map reduce cycles?
>>>>> 
>>>>> I just read a paper about the k-means clustering and its underlying
>>>>> algorithm.
>>>>> 
>>>>> According to that paper, k-means relies on a preknown/predefined
>> amount
>>>> of
>>>>> clusters as an input parameter.
>>>>> 
>>>>> Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
>>>>> 
>>>>> In my current scenario however, the number of clusters is unknown at
>>> the
>>>>> beginning.
>>>>> 
>>>>> Maybe k-means is just not the right algorithm for clustering similar
>>>>> products based on their short description text? What else could I
>> use?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 2013/10/1 Ted Dunning <te...@gmail.com>
>>>>> 
>>>>>> At such small sizes, I would guess that the sequential version of
>> the
>>>>>> streaming k-means or ball k-means would be better options.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
>>> jbonerz@googlemail.com
>>>>>>> wrote:
>>>>>> 
>>>>>>> Hello all,
>>>>>>> 
>>>>>>> I am currently trying create clusters from a group of 50.000
>>> strings
>>>>> that
>>>>>>> contain product descriptions (around 70-100 characters length
>>> each).
>>>>>>> 
>>>>>>> That group of 50.000 consists of roughly 5.000 individual
>> products
>>>> and
>>>>>> ten
>>>>>>> varying product descriptions per product. The product
>> descriptions
>>>> are
>>>>>>> already prepared for clustering and contain a normalized brand
>>> name,
>>>>>>> product
>>>>>>> model number, etc.
>>>>>>> 
>>>>>>> What would be a good approach to maximise the amound of found
>>>> clusters
>>>>>> (the
>>>>>>> best possible value would be 5.000 clusters with 10 products
>> each)
>>>>>>> 
>>>>>>> I adapted the reuters cluster script to read in my data and
>> managed
>>>> to
>>>>>>> create a first set of clusters. However, I have not managed to
>>>> maximise
>>>>>> the
>>>>>>> cluster count.
>>>>>>> 
>>>>>>> The question is: what do I need to tweak with regard to the
>>> available
>>>>>>> mahout
>>>>>>> settings, so the clusters are created as precisely as possible?
>>>>>>> 
>>>>>>> Many regards!
>>>>>>> Jens
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
>>>>>>> Sent from the Mahout User List mailing list archive at
>> Nabble.com.
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> CEO
>>> Hightech Marketing Group
>>> Cell/Mobile: +49 173 539 3588
>>> 
>>> ____
>>> 
>>> Hightech Marketing Group
>>> Frankenstraße 32
>>> 50354 Huerth
>>> Germany
>>> Phone: +49 (0)2233 – 619 2741
>>> Fax: +49 (0)2233 – 619 27419
>>> Web: www.hightechmg.com
>>> 
>> 
> 
> 
> 
> -- 
> CEO
> Hightech Marketing Group
> Cell/Mobile: +49 173 539 3588
> 
> ____
> 
> Hightech Marketing Group
> Frankenstraße 32
> 50354 Huerth
> Germany
> Phone: +49 (0)2233 – 619 2741
> Fax: +49 (0)2233 – 619 27419
> Web: www.hightechmg.com


Re: What are the best settings for my clustering task

Posted by Jens Bonerz <jb...@googlemail.com>.
I created a series of scripts to try out streamingkmeans in mahout an
increased the number of clusters to a high amount as suggested by Ted.
Everything seems to work. However, I can't figure out how to access the
actual cluster data at the end of the process.

It just gives me output that I cannot really understand... I would expect
my product_ids being referenced to cluster ids...

Example of the procedure's output:

hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running
locally
Input Path: file:MahoutCluster/part-r-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
Key: 0: Value: key = 8678, weight = 3.00, vector =
{37:26.83479118347168,6085:8.162049293518066,4785:10.44443130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307...
Key: 1: Value: key = 3118, weight = 14.00, vector =
{19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958...
Key: 2: Value: key = 2867, weight = 3.00, vector =
{15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.611329555511475,7950:8.364990234375,8018:8.68657398223877,15433:7.959...
Key: 3: Value: key = 6295, weight = 1.00, vector =
{17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1...
Key: 4: Value: key = 6725, weight = 4.00, vector =
{10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143...
Key: 5:......



this is my recipe:

--------------------------------------------------------------------------------------------------------
Step 1
Create a seqfile from my data with Python. Its the product_id (key) and the
short normalized descripti (value) that is written into the sequence file.



--------------------------------------------------------------------------------------------------------
Step 2
create vectors from that data with the following command:

mahout seq2sparse \
   -i productClusterSequenceData/productClusterSequenceData.seq \
   -o productClusterSequenceData/vectors \



--------------------------------------------------------------------------------------------------------
Step 3
Cluster the vectors using streamingkeans with this command:

mahout streamingkmeans \
-i productClusterSequenceData/vectors/tfidf-vectors \
-o MahoutCluster \
--tempDir /tmp \
-ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
-sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
-k 10000 -km 500000 \



--------------------------------------------------------------------------------------------------------
Step 4
Export the streamingkmeans cluster data into a textfile (for an extract of
the result see above)

mahout seqdumper \
-i MahoutCluster > similarProducts.txt

What am I missing?





2013/10/3 Ted Dunning <te...@gmail.com>

> Yes.  That will work.
>
> The sketch will then contain 10,000 x log N centroids.  If N = 10^9, log N
> \approx 30 so the sketch will have at about 300,000 weighted centroids in
> it.  The final clustering will have to process these centroids to produce
> the desired 5,000 clusters.  Since 300,000 is a relatively small number of
> data points, this clustering step should proceed relatively quickly.
>
>
>
> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <jb...@googlemail.com>
> wrote:
>
> > thx for your elaborate answer.
> >
> > so if the upper bound on the final number of clusters is unknown in the
> > beginning, what would happen, if I define a very high number that is
> > guaranteed to be > the estimated number of clusters.
> > for example if I set it to 10.000 clusters if an estimate of 5.000 is
> > likely, will that work?
> >
> >
> > 2013/10/2 Ted Dunning <te...@gmail.com>
> >
> > > The way that the new streaming k-means works is that there is a first
> > > sketch pass which only requires an upper bound on the final number of
> > > clusters you will want.  It adaptively creates more or less clusters
> > > depending on the data and your bound.  This sketch is guaranteed to be
> > > computed within at most one map-reduce pass.  There is a threaded
> version
> > > that runs (fast) on a single machine.  The threaded version is liable
> to
> > be
> > > faster than the map-reduce version for moderate or smaller data sizes.
> > >
> > > That sketch can then be used to do all kinds of things that rely on
> > > Euclidean distance and still get results within a small factor of the
> > same
> > > algorithm applied to all of the data.  Typically this second phase is a
> > > ball k-means algorithm, but it could easily be a dp-means algorithm [1]
> > if
> > > you want a variable number of clusters.  Indeed, you could run many
> > > dp-means passes with different values of lambda on the same sketch.
>  Note
> > > that the sketch is small enough that in-memory clustering is entirely
> > > viable and is very fast.
> > >
> > > For the problem you describe, however, you probably don't need the
> sketch
> > > approach at all and can probably apply ball k-means or dp-means
> directly.
> > >  Running many k-means clusterings with differing values of k should be
> > > entirely feasible as well with such data sizes.
> > >
> > > [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> > >
> > >
> > >
> > >
> > > On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <jb...@googlemail.com>
> > > wrote:
> > >
> > > > Isn't the streaming k-means just a different approach to crunch
> through
> > > the
> > > > data? In other words, the result of streaming k-means should be
> > > comparable
> > > > to using k-means in multiple chained map reduce cycles?
> > > >
> > > > I just read a paper about the k-means clustering and its underlying
> > > > algorithm.
> > > >
> > > > According to that paper, k-means relies on a preknown/predefined
> amount
> > > of
> > > > clusters as an input parameter.
> > > >
> > > > Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
> > > >
> > > > In my current scenario however, the number of clusters is unknown at
> > the
> > > > beginning.
> > > >
> > > > Maybe k-means is just not the right algorithm for clustering similar
> > > > products based on their short description text? What else could I
> use?
> > > >
> > > >
> > > >
> > > >
> > > > 2013/10/1 Ted Dunning <te...@gmail.com>
> > > >
> > > > > At such small sizes, I would guess that the sequential version of
> the
> > > > > streaming k-means or ball k-means would be better options.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
> > jbonerz@googlemail.com
> > > > > >wrote:
> > > > >
> > > > > > Hello all,
> > > > > >
> > > > > > I am currently trying create clusters from a group of 50.000
> > strings
> > > > that
> > > > > > contain product descriptions (around 70-100 characters length
> > each).
> > > > > >
> > > > > > That group of 50.000 consists of roughly 5.000 individual
> products
> > > and
> > > > > ten
> > > > > > varying product descriptions per product. The product
> descriptions
> > > are
> > > > > > already prepared for clustering and contain a normalized brand
> > name,
> > > > > > product
> > > > > > model number, etc.
> > > > > >
> > > > > > What would be a good approach to maximise the amound of found
> > > clusters
> > > > > (the
> > > > > > best possible value would be 5.000 clusters with 10 products
> each)
> > > > > >
> > > > > > I adapted the reuters cluster script to read in my data and
> managed
> > > to
> > > > > > create a first set of clusters. However, I have not managed to
> > > maximise
> > > > > the
> > > > > > cluster count.
> > > > > >
> > > > > > The question is: what do I need to tweak with regard to the
> > available
> > > > > > mahout
> > > > > > settings, so the clusters are created as precisely as possible?
> > > > > >
> > > > > > Many regards!
> > > > > > Jens
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > View this message in context:
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> > > > > > Sent from the Mahout User List mailing list archive at
> Nabble.com.
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > CEO
> > Hightech Marketing Group
> > Cell/Mobile: +49 173 539 3588
> >
> > ____
> >
> > Hightech Marketing Group
> > Frankenstraße 32
> > 50354 Huerth
> > Germany
> > Phone: +49 (0)2233 – 619 2741
> > Fax: +49 (0)2233 – 619 27419
> > Web: www.hightechmg.com
> >
>



-- 
CEO
Hightech Marketing Group
Cell/Mobile: +49 173 539 3588

____

Hightech Marketing Group
Frankenstraße 32
50354 Huerth
Germany
Phone: +49 (0)2233 – 619 2741
Fax: +49 (0)2233 – 619 27419
Web: www.hightechmg.com

Re: What are the best settings for my clustering task

Posted by Jens Bonerz <jb...@googlemail.com>.
I will try that right away. thanks Ted!


2013/10/3 Ted Dunning <te...@gmail.com>

> Yes.  That will work.
>
> The sketch will then contain 10,000 x log N centroids.  If N = 10^9, log N
> \approx 30 so the sketch will have at about 300,000 weighted centroids in
> it.  The final clustering will have to process these centroids to produce
> the desired 5,000 clusters.  Since 300,000 is a relatively small number of
> data points, this clustering step should proceed relatively quickly.
>
>
>
> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <jb...@googlemail.com>
> wrote:
>
> > thx for your elaborate answer.
> >
> > so if the upper bound on the final number of clusters is unknown in the
> > beginning, what would happen, if I define a very high number that is
> > guaranteed to be > the estimated number of clusters.
> > for example if I set it to 10.000 clusters if an estimate of 5.000 is
> > likely, will that work?
> >
> >
> > 2013/10/2 Ted Dunning <te...@gmail.com>
> >
> > > The way that the new streaming k-means works is that there is a first
> > > sketch pass which only requires an upper bound on the final number of
> > > clusters you will want.  It adaptively creates more or less clusters
> > > depending on the data and your bound.  This sketch is guaranteed to be
> > > computed within at most one map-reduce pass.  There is a threaded
> version
> > > that runs (fast) on a single machine.  The threaded version is liable
> to
> > be
> > > faster than the map-reduce version for moderate or smaller data sizes.
> > >
> > > That sketch can then be used to do all kinds of things that rely on
> > > Euclidean distance and still get results within a small factor of the
> > same
> > > algorithm applied to all of the data.  Typically this second phase is a
> > > ball k-means algorithm, but it could easily be a dp-means algorithm [1]
> > if
> > > you want a variable number of clusters.  Indeed, you could run many
> > > dp-means passes with different values of lambda on the same sketch.
>  Note
> > > that the sketch is small enough that in-memory clustering is entirely
> > > viable and is very fast.
> > >
> > > For the problem you describe, however, you probably don't need the
> sketch
> > > approach at all and can probably apply ball k-means or dp-means
> directly.
> > >  Running many k-means clusterings with differing values of k should be
> > > entirely feasible as well with such data sizes.
> > >
> > > [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> > >
> > >
> > >
> > >
> > > On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <jb...@googlemail.com>
> > > wrote:
> > >
> > > > Isn't the streaming k-means just a different approach to crunch
> through
> > > the
> > > > data? In other words, the result of streaming k-means should be
> > > comparable
> > > > to using k-means in multiple chained map reduce cycles?
> > > >
> > > > I just read a paper about the k-means clustering and its underlying
> > > > algorithm.
> > > >
> > > > According to that paper, k-means relies on a preknown/predefined
> amount
> > > of
> > > > clusters as an input parameter.
> > > >
> > > > Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
> > > >
> > > > In my current scenario however, the number of clusters is unknown at
> > the
> > > > beginning.
> > > >
> > > > Maybe k-means is just not the right algorithm for clustering similar
> > > > products based on their short description text? What else could I
> use?
> > > >
> > > >
> > > >
> > > >
> > > > 2013/10/1 Ted Dunning <te...@gmail.com>
> > > >
> > > > > At such small sizes, I would guess that the sequential version of
> the
> > > > > streaming k-means or ball k-means would be better options.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
> > jbonerz@googlemail.com
> > > > > >wrote:
> > > > >
> > > > > > Hello all,
> > > > > >
> > > > > > I am currently trying create clusters from a group of 50.000
> > strings
> > > > that
> > > > > > contain product descriptions (around 70-100 characters length
> > each).
> > > > > >
> > > > > > That group of 50.000 consists of roughly 5.000 individual
> products
> > > and
> > > > > ten
> > > > > > varying product descriptions per product. The product
> descriptions
> > > are
> > > > > > already prepared for clustering and contain a normalized brand
> > name,
> > > > > > product
> > > > > > model number, etc.
> > > > > >
> > > > > > What would be a good approach to maximise the amound of found
> > > clusters
> > > > > (the
> > > > > > best possible value would be 5.000 clusters with 10 products
> each)
> > > > > >
> > > > > > I adapted the reuters cluster script to read in my data and
> managed
> > > to
> > > > > > create a first set of clusters. However, I have not managed to
> > > maximise
> > > > > the
> > > > > > cluster count.
> > > > > >
> > > > > > The question is: what do I need to tweak with regard to the
> > available
> > > > > > mahout
> > > > > > settings, so the clusters are created as precisely as possible?
> > > > > >
> > > > > > Many regards!
> > > > > > Jens
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > View this message in context:
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> > > > > > Sent from the Mahout User List mailing list archive at
> Nabble.com.
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > CEO
> > Hightech Marketing Group
> > Cell/Mobile: +49 173 539 3588
> >
> > ____
> >
> > Hightech Marketing Group
> > Frankenstraße 32
> > 50354 Huerth
> > Germany
> > Phone: +49 (0)2233 – 619 2741
> > Fax: +49 (0)2233 – 619 27419
> > Web: www.hightechmg.com
> >
>



-- 
CEO
Hightech Marketing Group
Cell/Mobile: +49 173 539 3588

____

Hightech Marketing Group
Frankenstraße 32
50354 Huerth
Germany
Phone: +49 (0)2233 – 619 2741
Fax: +49 (0)2233 – 619 27419
Web: www.hightechmg.com

Re: What are the best settings for my clustering task

Posted by Ted Dunning <te...@gmail.com>.
Yes.  That will work.

The sketch will then contain 10,000 x log N centroids.  If N = 10^9, log N
\approx 30 so the sketch will have at about 300,000 weighted centroids in
it.  The final clustering will have to process these centroids to produce
the desired 5,000 clusters.  Since 300,000 is a relatively small number of
data points, this clustering step should proceed relatively quickly.



On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <jb...@googlemail.com> wrote:

> thx for your elaborate answer.
>
> so if the upper bound on the final number of clusters is unknown in the
> beginning, what would happen, if I define a very high number that is
> guaranteed to be > the estimated number of clusters.
> for example if I set it to 10.000 clusters if an estimate of 5.000 is
> likely, will that work?
>
>
> 2013/10/2 Ted Dunning <te...@gmail.com>
>
> > The way that the new streaming k-means works is that there is a first
> > sketch pass which only requires an upper bound on the final number of
> > clusters you will want.  It adaptively creates more or less clusters
> > depending on the data and your bound.  This sketch is guaranteed to be
> > computed within at most one map-reduce pass.  There is a threaded version
> > that runs (fast) on a single machine.  The threaded version is liable to
> be
> > faster than the map-reduce version for moderate or smaller data sizes.
> >
> > That sketch can then be used to do all kinds of things that rely on
> > Euclidean distance and still get results within a small factor of the
> same
> > algorithm applied to all of the data.  Typically this second phase is a
> > ball k-means algorithm, but it could easily be a dp-means algorithm [1]
> if
> > you want a variable number of clusters.  Indeed, you could run many
> > dp-means passes with different values of lambda on the same sketch.  Note
> > that the sketch is small enough that in-memory clustering is entirely
> > viable and is very fast.
> >
> > For the problem you describe, however, you probably don't need the sketch
> > approach at all and can probably apply ball k-means or dp-means directly.
> >  Running many k-means clusterings with differing values of k should be
> > entirely feasible as well with such data sizes.
> >
> > [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> >
> >
> >
> >
> > On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <jb...@googlemail.com>
> > wrote:
> >
> > > Isn't the streaming k-means just a different approach to crunch through
> > the
> > > data? In other words, the result of streaming k-means should be
> > comparable
> > > to using k-means in multiple chained map reduce cycles?
> > >
> > > I just read a paper about the k-means clustering and its underlying
> > > algorithm.
> > >
> > > According to that paper, k-means relies on a preknown/predefined amount
> > of
> > > clusters as an input parameter.
> > >
> > > Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
> > >
> > > In my current scenario however, the number of clusters is unknown at
> the
> > > beginning.
> > >
> > > Maybe k-means is just not the right algorithm for clustering similar
> > > products based on their short description text? What else could I use?
> > >
> > >
> > >
> > >
> > > 2013/10/1 Ted Dunning <te...@gmail.com>
> > >
> > > > At such small sizes, I would guess that the sequential version of the
> > > > streaming k-means or ball k-means would be better options.
> > > >
> > > >
> > > >
> > > > On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
> jbonerz@googlemail.com
> > > > >wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > I am currently trying create clusters from a group of 50.000
> strings
> > > that
> > > > > contain product descriptions (around 70-100 characters length
> each).
> > > > >
> > > > > That group of 50.000 consists of roughly 5.000 individual products
> > and
> > > > ten
> > > > > varying product descriptions per product. The product descriptions
> > are
> > > > > already prepared for clustering and contain a normalized brand
> name,
> > > > > product
> > > > > model number, etc.
> > > > >
> > > > > What would be a good approach to maximise the amound of found
> > clusters
> > > > (the
> > > > > best possible value would be 5.000 clusters with 10 products each)
> > > > >
> > > > > I adapted the reuters cluster script to read in my data and managed
> > to
> > > > > create a first set of clusters. However, I have not managed to
> > maximise
> > > > the
> > > > > cluster count.
> > > > >
> > > > > The question is: what do I need to tweak with regard to the
> available
> > > > > mahout
> > > > > settings, so the clusters are created as precisely as possible?
> > > > >
> > > > > Many regards!
> > > > > Jens
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > View this message in context:
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> > > > > Sent from the Mahout User List mailing list archive at Nabble.com.
> > > > >
> > > >
> > >
> >
>
>
>
> --
> CEO
> Hightech Marketing Group
> Cell/Mobile: +49 173 539 3588
>
> ____
>
> Hightech Marketing Group
> Frankenstraße 32
> 50354 Huerth
> Germany
> Phone: +49 (0)2233 – 619 2741
> Fax: +49 (0)2233 – 619 27419
> Web: www.hightechmg.com
>

Re: What are the best settings for my clustering task

Posted by Jens Bonerz <jb...@googlemail.com>.
thx for your elaborate answer.

so if the upper bound on the final number of clusters is unknown in the
beginning, what would happen, if I define a very high number that is
guaranteed to be > the estimated number of clusters.
for example if I set it to 10.000 clusters if an estimate of 5.000 is
likely, will that work?


2013/10/2 Ted Dunning <te...@gmail.com>

> The way that the new streaming k-means works is that there is a first
> sketch pass which only requires an upper bound on the final number of
> clusters you will want.  It adaptively creates more or less clusters
> depending on the data and your bound.  This sketch is guaranteed to be
> computed within at most one map-reduce pass.  There is a threaded version
> that runs (fast) on a single machine.  The threaded version is liable to be
> faster than the map-reduce version for moderate or smaller data sizes.
>
> That sketch can then be used to do all kinds of things that rely on
> Euclidean distance and still get results within a small factor of the same
> algorithm applied to all of the data.  Typically this second phase is a
> ball k-means algorithm, but it could easily be a dp-means algorithm [1] if
> you want a variable number of clusters.  Indeed, you could run many
> dp-means passes with different values of lambda on the same sketch.  Note
> that the sketch is small enough that in-memory clustering is entirely
> viable and is very fast.
>
> For the problem you describe, however, you probably don't need the sketch
> approach at all and can probably apply ball k-means or dp-means directly.
>  Running many k-means clusterings with differing values of k should be
> entirely feasible as well with such data sizes.
>
> [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
>
>
>
>
> On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <jb...@googlemail.com>
> wrote:
>
> > Isn't the streaming k-means just a different approach to crunch through
> the
> > data? In other words, the result of streaming k-means should be
> comparable
> > to using k-means in multiple chained map reduce cycles?
> >
> > I just read a paper about the k-means clustering and its underlying
> > algorithm.
> >
> > According to that paper, k-means relies on a preknown/predefined amount
> of
> > clusters as an input parameter.
> >
> > Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
> >
> > In my current scenario however, the number of clusters is unknown at the
> > beginning.
> >
> > Maybe k-means is just not the right algorithm for clustering similar
> > products based on their short description text? What else could I use?
> >
> >
> >
> >
> > 2013/10/1 Ted Dunning <te...@gmail.com>
> >
> > > At such small sizes, I would guess that the sequential version of the
> > > streaming k-means or ball k-means would be better options.
> > >
> > >
> > >
> > > On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <jbonerz@googlemail.com
> > > >wrote:
> > >
> > > > Hello all,
> > > >
> > > > I am currently trying create clusters from a group of 50.000 strings
> > that
> > > > contain product descriptions (around 70-100 characters length each).
> > > >
> > > > That group of 50.000 consists of roughly 5.000 individual products
> and
> > > ten
> > > > varying product descriptions per product. The product descriptions
> are
> > > > already prepared for clustering and contain a normalized brand name,
> > > > product
> > > > model number, etc.
> > > >
> > > > What would be a good approach to maximise the amound of found
> clusters
> > > (the
> > > > best possible value would be 5.000 clusters with 10 products each)
> > > >
> > > > I adapted the reuters cluster script to read in my data and managed
> to
> > > > create a first set of clusters. However, I have not managed to
> maximise
> > > the
> > > > cluster count.
> > > >
> > > > The question is: what do I need to tweak with regard to the available
> > > > mahout
> > > > settings, so the clusters are created as precisely as possible?
> > > >
> > > > Many regards!
> > > > Jens
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> > > > Sent from the Mahout User List mailing list archive at Nabble.com.
> > > >
> > >
> >
>



-- 
CEO
Hightech Marketing Group
Cell/Mobile: +49 173 539 3588

____

Hightech Marketing Group
Frankenstraße 32
50354 Huerth
Germany
Phone: +49 (0)2233 – 619 2741
Fax: +49 (0)2233 – 619 27419
Web: www.hightechmg.com

Re: What are the best settings for my clustering task

Posted by Ted Dunning <te...@gmail.com>.
The way that the new streaming k-means works is that there is a first
sketch pass which only requires an upper bound on the final number of
clusters you will want.  It adaptively creates more or less clusters
depending on the data and your bound.  This sketch is guaranteed to be
computed within at most one map-reduce pass.  There is a threaded version
that runs (fast) on a single machine.  The threaded version is liable to be
faster than the map-reduce version for moderate or smaller data sizes.

That sketch can then be used to do all kinds of things that rely on
Euclidean distance and still get results within a small factor of the same
algorithm applied to all of the data.  Typically this second phase is a
ball k-means algorithm, but it could easily be a dp-means algorithm [1] if
you want a variable number of clusters.  Indeed, you could run many
dp-means passes with different values of lambda on the same sketch.  Note
that the sketch is small enough that in-memory clustering is entirely
viable and is very fast.

For the problem you describe, however, you probably don't need the sketch
approach at all and can probably apply ball k-means or dp-means directly.
 Running many k-means clusterings with differing values of k should be
entirely feasible as well with such data sizes.

[1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf




On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <jb...@googlemail.com> wrote:

> Isn't the streaming k-means just a different approach to crunch through the
> data? In other words, the result of streaming k-means should be comparable
> to using k-means in multiple chained map reduce cycles?
>
> I just read a paper about the k-means clustering and its underlying
> algorithm.
>
> According to that paper, k-means relies on a preknown/predefined amount of
> clusters as an input parameter.
>
> Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
>
> In my current scenario however, the number of clusters is unknown at the
> beginning.
>
> Maybe k-means is just not the right algorithm for clustering similar
> products based on their short description text? What else could I use?
>
>
>
>
> 2013/10/1 Ted Dunning <te...@gmail.com>
>
> > At such small sizes, I would guess that the sequential version of the
> > streaming k-means or ball k-means would be better options.
> >
> >
> >
> > On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <jbonerz@googlemail.com
> > >wrote:
> >
> > > Hello all,
> > >
> > > I am currently trying create clusters from a group of 50.000 strings
> that
> > > contain product descriptions (around 70-100 characters length each).
> > >
> > > That group of 50.000 consists of roughly 5.000 individual products and
> > ten
> > > varying product descriptions per product. The product descriptions are
> > > already prepared for clustering and contain a normalized brand name,
> > > product
> > > model number, etc.
> > >
> > > What would be a good approach to maximise the amound of found clusters
> > (the
> > > best possible value would be 5.000 clusters with 10 products each)
> > >
> > > I adapted the reuters cluster script to read in my data and managed to
> > > create a first set of clusters. However, I have not managed to maximise
> > the
> > > cluster count.
> > >
> > > The question is: what do I need to tweak with regard to the available
> > > mahout
> > > settings, so the clusters are created as precisely as possible?
> > >
> > > Many regards!
> > > Jens
> > >
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> > > Sent from the Mahout User List mailing list archive at Nabble.com.
> > >
> >
>

Re: What are the best settings for my clustering task

Posted by Jens Bonerz <jb...@googlemail.com>.
Isn't the streaming k-means just a different approach to crunch through the
data? In other words, the result of streaming k-means should be comparable
to using k-means in multiple chained map reduce cycles?

I just read a paper about the k-means clustering and its underlying
algorithm.

According to that paper, k-means relies on a preknown/predefined amount of
clusters as an input parameter.

Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf

In my current scenario however, the number of clusters is unknown at the
beginning.

Maybe k-means is just not the right algorithm for clustering similar
products based on their short description text? What else could I use?




2013/10/1 Ted Dunning <te...@gmail.com>

> At such small sizes, I would guess that the sequential version of the
> streaming k-means or ball k-means would be better options.
>
>
>
> On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <jbonerz@googlemail.com
> >wrote:
>
> > Hello all,
> >
> > I am currently trying create clusters from a group of 50.000 strings that
> > contain product descriptions (around 70-100 characters length each).
> >
> > That group of 50.000 consists of roughly 5.000 individual products and
> ten
> > varying product descriptions per product. The product descriptions are
> > already prepared for clustering and contain a normalized brand name,
> > product
> > model number, etc.
> >
> > What would be a good approach to maximise the amound of found clusters
> (the
> > best possible value would be 5.000 clusters with 10 products each)
> >
> > I adapted the reuters cluster script to read in my data and managed to
> > create a first set of clusters. However, I have not managed to maximise
> the
> > cluster count.
> >
> > The question is: what do I need to tweak with regard to the available
> > mahout
> > settings, so the clusters are created as precisely as possible?
> >
> > Many regards!
> > Jens
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> > Sent from the Mahout User List mailing list archive at Nabble.com.
> >
>

Re: What are the best settings for my clustering task

Posted by Ted Dunning <te...@gmail.com>.
At such small sizes, I would guess that the sequential version of the
streaming k-means or ball k-means would be better options.



On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <jb...@googlemail.com>wrote:

> Hello all,
>
> I am currently trying create clusters from a group of 50.000 strings that
> contain product descriptions (around 70-100 characters length each).
>
> That group of 50.000 consists of roughly 5.000 individual products and ten
> varying product descriptions per product. The product descriptions are
> already prepared for clustering and contain a normalized brand name,
> product
> model number, etc.
>
> What would be a good approach to maximise the amound of found clusters (the
> best possible value would be 5.000 clusters with 10 products each)
>
> I adapted the reuters cluster script to read in my data and managed to
> create a first set of clusters. However, I have not managed to maximise the
> cluster count.
>
> The question is: what do I need to tweak with regard to the available
> mahout
> settings, so the clusters are created as precisely as possible?
>
> Many regards!
> Jens
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>