You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Jure Jeseničnik <Ju...@planet9.si> on 2010/11/17 12:54:11 UTC

Canopy memory consumption

Hi Guys.

What I'm trying to do is the basic news clustering, that will group the news about the same topic into clusters. I have the data in a database so I took the following approach:

1. Wrote a small program that puts the data from the db into a Lucene Index.

2. Created vectors from index with the following command:
mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i link -n 2

3. Ran canopy, to get initial clusters:
mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow

4. Ran the kmeans to perform the final clustering:
mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10 -cl -ow

5. Do the clusterdump to view results:
mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p output-kmeans/clusteredPoints -dt text -b 100 -n 10 > result.txt

When I run this with cca 1000 records (8000 distinct terms), the results are just perfect. I get exactly the clusters I want. The problems start when I try the same steps with a bit more data.

With 6000 records (28000 terms) or even the half of that, the process fails at the canopy step with Java heap space OutOfMemoryError. The MAHOUT_HEAPSIZE variable value on my local machine is 1024. I even tried running it on our development hadoop cluster with approximately the same amount of memory, but it failed with the same error.

I realize that software needs a certain amount of memory to work properly but I find it hard to believe that 1 GB is not enough for processing a 3.1 MB file, which is the size of the vectors file produced by the second step. We're hoping to use this solution on a hundreds of thousands of records and I can't help but to wonder what sort of hardware we'll be needing in order to process them if such memory consumption is a normal thing.

Am I missing something here? Are there any other setting that I should be taking into consideration.

And one more thing. I tried the meanshift implementation and it seems to be working fine, with that much data.

Thanks.

Jure

RE: Canopy memory consumption

Posted by Jure Jeseničnik <Ju...@planet9.si>.

It works now.

Thank you very much for this Jeff. 

Best regards.

Jure

-----Original Message-----
From: Jeff Eastman [mailto:jeastman@Narus.com] 
Sent: Friday, November 19, 2010 8:41 PM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Ok, I can duplicate this and here is the problem:

1. The given T1 and T2 values create a lot of clusters with only 1 point in them
2. AbstractCluster.computeParameters() special-cases single-point cluster radius calculation 
	"else {radius = radius.assign(Double.MIN_NORMAL);}"
3. This has the effect of setting *every element in the radius vector* to MIN_NORMAL. But your vectors are large and sparse so the memory consumption skyrockets, causing the OME.
4. I changed the radius calculation by removing the entire else clause and your data runs fine now. The result is that the radius will be exactly zero instead of MIN_NORMAL. IIRC, this was an interim solution to divide by zero errors during pdf() calculations but we later fixed those by adding a very small prior to the radius. All of the unit tests continue to run so I will commit the change this weekend.

Thanks for finding this corner case
Jeff

-----Original Message-----
From: Jeff Eastman [mailto:jeastman@narus.com] 
Sent: Friday, November 19, 2010 9:05 AM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Hi Jure,

Thanks for the data. I will run this over the weekend and get back to you.

In both Canopy and Mean Shift, the T2 parameter is critical for determining the number of clusters after the first pass. In Canopy, any input vector that is within T2 distance from an existing Canopy will not generate a new Canopy. In Mean Shift, a MeanShiftCanopy that is within T2 distance from an existing MSCanopy will be merged with it. The T1 parameters influence which points are considered in calculating the new centroid for the cluster.

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Friday, November 19, 2010 1:35 AM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Here's the folder that I am using as an input: http://dl.dropbox.com/u/9352657/input.zip. 
The results that I'm looking for should contain  somewhere around 5000 clusters. It might sound unusual but that's just the nature of our problem. 
We got the best results with the Meanshift (T1=1.0 T2=1.35). Results of this clustering were checked "by hand" and it was confirmed that this is what we are looking for (5013 clusters with some minor anomalies). Canopy failed with this values.

I would still like to do this the proper way, with T1>T2, but I seem to have trouble finding the proper input distances for a good result. I'm currently working on this but still I would appreciate any help you could give me on determining the proper distances. Trial & error seems like looking for a needle in a haystack.
T1=1.0 T2=0.6 gave me some results with Meanshift , bit the Canopy kept failing with this values also. I also tried the sequential approach, but it failed due to lack of memory too, it just took much, much longer.

As you mentioned yourself T1>T2 should probably be enforced and I would not like to rely on a solution that is based on a missing "sanity" check. Who knows what the future will bring.

Thanks.

Jure




-----Original Message-----
From: Jeff Eastman [mailto:jeastman@Narus.com] 
Sent: Thursday, November 18, 2010 5:38 PM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

900 clusters from 1000 vectors seems unusual. I'd be looking for a clustering that produced maybe 5-10% of that. Looking over your parameters, I notice your T1 value is less than T2. This violates the T1>T2 expectation for both Canopy and Mean Shift which is, apparently, not enforced. It probably should be and this might be the source of your problems but I'm not sure how this could cause a premature OME.

In terms of using Mean Shift, I'd say the proof of the pudding is in the eating. If it gives you reasonable results and can handle your data then it's all good. Canopy/k-Means is more of a main-stream approach and *should* scale better. I'd be interested in seeing a stack trace of where Canopy is bombing on you. A gig of memory should be more than enough to run your 3.1 Mb file - using sequential (-xm sequential) execution method, never mind using mapreduce!

Any chance you could share your input vectors file?

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Wednesday, November 17, 2010 11:02 PM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Hi Jeff

Thank you for your answer.  On a smaller scale I got around 10% less clusters than there are records (900 clusters from 1000 records). This corresponds with the actual data that I fed to the Canopy and I even checked the results manually and It was almost exactly what I wanted. A bit more fiddeling with the T1 and T2 and it would have been it. 
When I run  the Meanshift with the same T1 and T2 it is able to process 6000 clusters with ease. On  the cases where I was able to get the Canopy+k-means through, the results seemed pretty similar of those that he Meanshift gave me. 

Could Meanshift be the path that I'm looking for or is there a possibility of running into problems later? 

Regards,

Jure


-----Original Message-----
From: Jeff Eastman [mailto:jeastman@Narus.com] 
Sent: Thursday, November 18, 2010 1:02 AM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Canopy is a bit fussy about its T1 and T2 parameters: If you set T2 too small, you will get one cluster for each input vector; too large and you will get only one cluster for all vectors. T1 is less sensitive and will only impact how many points near each cluster are included in its centroid calculation.  My guess is you are in the first situation with T2 too small and, with the larger dataset, are creating more clusters than will fit into your memory.

How many clusters did you get from your small dataset? If the small set is a subset of the large set you could always run Canopy over the small set to get your k-means initial cluster centers, then run k-means iterations over the full dataset after. You can also skip the Canopy step entirely when using k-means: include a -k parameter and k-means will sample that many initial cluster centers from your data and then run its iterations. 

Glad to hear MeanShift is working for you. It has similar scaling limitations to Canopy. I've been pleasantly surprised by its performance on problems I thought were out of scope for it. Don't know why it works on your larger dataset when Canopy fails though.

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Wednesday, November 17, 2010 3:54 AM
To: user@mahout.apache.org
Subject: Canopy memory consumption

Hi Guys.

What I'm trying to do is the basic news clustering, that will group the news about the same topic into clusters.  I have the data in a database so I took the following approach:

1.       Wrote a small program that puts the data from the db into a Lucene Index.

2.       Created vectors from index with the following command:
mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i link -n 2

3.       Ran canopy, to get initial clusters:
mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow

4.       Ran the kmeans to perform the final clustering:
mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10 -cl -ow

5.       Do the clusterdump to view results:
mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p output-kmeans/clusteredPoints -dt text -b 100 -n 10 > result.txt

When I run this with cca 1000 records (8000 distinct terms), the results are just perfect. I get exactly the clusters I want. The problems start when I try the same steps with a bit more data.

With 6000 records (28000 terms) or even the half of that, the process fails at the canopy step with Java heap space OutOfMemoryError. The  MAHOUT_HEAPSIZE variable value on my local machine is 1024.  I even tried running it on our development hadoop cluster with approximately the same amount of memory, but it failed with the same error.

I realize  that software needs a certain amount of memory to work properly but I find it hard to believe that 1 GB is not enough for processing a 3.1 MB file, which is the size of the vectors file produced by the second step. We're hoping to use this solution on a hundreds of thousands of records and I can't help but to wonder what sort of hardware we'll be needing in order to process them if such memory consumption is a normal thing.

Am I missing something here? Are there any other setting that I should be taking into consideration.

And one more thing. I tried the meanshift implementation and it seems to be working fine, with that much data.

Thanks.

Jure

RE: Canopy memory consumption

Posted by Jeff Eastman <je...@Narus.com>.

Ok, I can duplicate this and here is the problem:

1. The given T1 and T2 values create a lot of clusters with only 1 point in them
2. AbstractCluster.computeParameters() special-cases single-point cluster radius calculation 
	"else {radius = radius.assign(Double.MIN_NORMAL);}"
3. This has the effect of setting *every element in the radius vector* to MIN_NORMAL. But your vectors are large and sparse so the memory consumption skyrockets, causing the OME.
4. I changed the radius calculation by removing the entire else clause and your data runs fine now. The result is that the radius will be exactly zero instead of MIN_NORMAL. IIRC, this was an interim solution to divide by zero errors during pdf() calculations but we later fixed those by adding a very small prior to the radius. All of the unit tests continue to run so I will commit the change this weekend.

Thanks for finding this corner case
Jeff

-----Original Message-----
From: Jeff Eastman [mailto:jeastman@narus.com] 
Sent: Friday, November 19, 2010 9:05 AM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Hi Jure,

Thanks for the data. I will run this over the weekend and get back to you.

In both Canopy and Mean Shift, the T2 parameter is critical for determining the number of clusters after the first pass. In Canopy, any input vector that is within T2 distance from an existing Canopy will not generate a new Canopy. In Mean Shift, a MeanShiftCanopy that is within T2 distance from an existing MSCanopy will be merged with it. The T1 parameters influence which points are considered in calculating the new centroid for the cluster.

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Friday, November 19, 2010 1:35 AM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Here's the folder that I am using as an input: http://dl.dropbox.com/u/9352657/input.zip. 
The results that I'm looking for should contain  somewhere around 5000 clusters. It might sound unusual but that's just the nature of our problem. 
We got the best results with the Meanshift (T1=1.0 T2=1.35). Results of this clustering were checked "by hand" and it was confirmed that this is what we are looking for (5013 clusters with some minor anomalies). Canopy failed with this values.

I would still like to do this the proper way, with T1>T2, but I seem to have trouble finding the proper input distances for a good result. I'm currently working on this but still I would appreciate any help you could give me on determining the proper distances. Trial & error seems like looking for a needle in a haystack.
T1=1.0 T2=0.6 gave me some results with Meanshift , bit the Canopy kept failing with this values also. I also tried the sequential approach, but it failed due to lack of memory too, it just took much, much longer.

As you mentioned yourself T1>T2 should probably be enforced and I would not like to rely on a solution that is based on a missing "sanity" check. Who knows what the future will bring.

Thanks.

Jure




-----Original Message-----
From: Jeff Eastman [mailto:jeastman@Narus.com] 
Sent: Thursday, November 18, 2010 5:38 PM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

900 clusters from 1000 vectors seems unusual. I'd be looking for a clustering that produced maybe 5-10% of that. Looking over your parameters, I notice your T1 value is less than T2. This violates the T1>T2 expectation for both Canopy and Mean Shift which is, apparently, not enforced. It probably should be and this might be the source of your problems but I'm not sure how this could cause a premature OME.

In terms of using Mean Shift, I'd say the proof of the pudding is in the eating. If it gives you reasonable results and can handle your data then it's all good. Canopy/k-Means is more of a main-stream approach and *should* scale better. I'd be interested in seeing a stack trace of where Canopy is bombing on you. A gig of memory should be more than enough to run your 3.1 Mb file - using sequential (-xm sequential) execution method, never mind using mapreduce!

Any chance you could share your input vectors file?

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Wednesday, November 17, 2010 11:02 PM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Hi Jeff

Thank you for your answer.  On a smaller scale I got around 10% less clusters than there are records (900 clusters from 1000 records). This corresponds with the actual data that I fed to the Canopy and I even checked the results manually and It was almost exactly what I wanted. A bit more fiddeling with the T1 and T2 and it would have been it. 
When I run  the Meanshift with the same T1 and T2 it is able to process 6000 clusters with ease. On  the cases where I was able to get the Canopy+k-means through, the results seemed pretty similar of those that he Meanshift gave me. 

Could Meanshift be the path that I'm looking for or is there a possibility of running into problems later? 

Regards,

Jure


-----Original Message-----
From: Jeff Eastman [mailto:jeastman@Narus.com] 
Sent: Thursday, November 18, 2010 1:02 AM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Canopy is a bit fussy about its T1 and T2 parameters: If you set T2 too small, you will get one cluster for each input vector; too large and you will get only one cluster for all vectors. T1 is less sensitive and will only impact how many points near each cluster are included in its centroid calculation.  My guess is you are in the first situation with T2 too small and, with the larger dataset, are creating more clusters than will fit into your memory.

How many clusters did you get from your small dataset? If the small set is a subset of the large set you could always run Canopy over the small set to get your k-means initial cluster centers, then run k-means iterations over the full dataset after. You can also skip the Canopy step entirely when using k-means: include a -k parameter and k-means will sample that many initial cluster centers from your data and then run its iterations. 

Glad to hear MeanShift is working for you. It has similar scaling limitations to Canopy. I've been pleasantly surprised by its performance on problems I thought were out of scope for it. Don't know why it works on your larger dataset when Canopy fails though.

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Wednesday, November 17, 2010 3:54 AM
To: user@mahout.apache.org
Subject: Canopy memory consumption

Hi Guys.

What I'm trying to do is the basic news clustering, that will group the news about the same topic into clusters.  I have the data in a database so I took the following approach:

1.       Wrote a small program that puts the data from the db into a Lucene Index.

2.       Created vectors from index with the following command:
mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i link -n 2

3.       Ran canopy, to get initial clusters:
mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow

4.       Ran the kmeans to perform the final clustering:
mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10 -cl -ow

5.       Do the clusterdump to view results:
mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p output-kmeans/clusteredPoints -dt text -b 100 -n 10 > result.txt

When I run this with cca 1000 records (8000 distinct terms), the results are just perfect. I get exactly the clusters I want. The problems start when I try the same steps with a bit more data.

With 6000 records (28000 terms) or even the half of that, the process fails at the canopy step with Java heap space OutOfMemoryError. The  MAHOUT_HEAPSIZE variable value on my local machine is 1024.  I even tried running it on our development hadoop cluster with approximately the same amount of memory, but it failed with the same error.

I realize  that software needs a certain amount of memory to work properly but I find it hard to believe that 1 GB is not enough for processing a 3.1 MB file, which is the size of the vectors file produced by the second step. We're hoping to use this solution on a hundreds of thousands of records and I can't help but to wonder what sort of hardware we'll be needing in order to process them if such memory consumption is a normal thing.

Am I missing something here? Are there any other setting that I should be taking into consideration.

And one more thing. I tried the meanshift implementation and it seems to be working fine, with that much data.

Thanks.

Jure

RE: Canopy memory consumption

Posted by Jeff Eastman <je...@Narus.com>.

Hi Jure,

Thanks for the data. I will run this over the weekend and get back to you.

In both Canopy and Mean Shift, the T2 parameter is critical for determining the number of clusters after the first pass. In Canopy, any input vector that is within T2 distance from an existing Canopy will not generate a new Canopy. In Mean Shift, a MeanShiftCanopy that is within T2 distance from an existing MSCanopy will be merged with it. The T1 parameters influence which points are considered in calculating the new centroid for the cluster.

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Friday, November 19, 2010 1:35 AM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Here's the folder that I am using as an input: http://dl.dropbox.com/u/9352657/input.zip. 
The results that I'm looking for should contain  somewhere around 5000 clusters. It might sound unusual but that's just the nature of our problem. 
We got the best results with the Meanshift (T1=1.0 T2=1.35). Results of this clustering were checked "by hand" and it was confirmed that this is what we are looking for (5013 clusters with some minor anomalies). Canopy failed with this values.

I would still like to do this the proper way, with T1>T2, but I seem to have trouble finding the proper input distances for a good result. I'm currently working on this but still I would appreciate any help you could give me on determining the proper distances. Trial & error seems like looking for a needle in a haystack.
T1=1.0 T2=0.6 gave me some results with Meanshift , bit the Canopy kept failing with this values also. I also tried the sequential approach, but it failed due to lack of memory too, it just took much, much longer.

As you mentioned yourself T1>T2 should probably be enforced and I would not like to rely on a solution that is based on a missing "sanity" check. Who knows what the future will bring.

Thanks.

Jure

-----Original Message-----
From: Jeff Eastman [mailto:jeastman@Narus.com] 
Sent: Thursday, November 18, 2010 5:38 PM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

900 clusters from 1000 vectors seems unusual. I'd be looking for a clustering that produced maybe 5-10% of that. Looking over your parameters, I notice your T1 value is less than T2. This violates the T1>T2 expectation for both Canopy and Mean Shift which is, apparently, not enforced. It probably should be and this might be the source of your problems but I'm not sure how this could cause a premature OME.

In terms of using Mean Shift, I'd say the proof of the pudding is in the eating. If it gives you reasonable results and can handle your data then it's all good. Canopy/k-Means is more of a main-stream approach and *should* scale better. I'd be interested in seeing a stack trace of where Canopy is bombing on you. A gig of memory should be more than enough to run your 3.1 Mb file - using sequential (-xm sequential) execution method, never mind using mapreduce!

Any chance you could share your input vectors file?

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Wednesday, November 17, 2010 11:02 PM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Hi Jeff

Thank you for your answer.  On a smaller scale I got around 10% less clusters than there are records (900 clusters from 1000 records). This corresponds with the actual data that I fed to the Canopy and I even checked the results manually and It was almost exactly what I wanted. A bit more fiddeling with the T1 and T2 and it would have been it. 
When I run  the Meanshift with the same T1 and T2 it is able to process 6000 clusters with ease. On  the cases where I was able to get the Canopy+k-means through, the results seemed pretty similar of those that he Meanshift gave me. 

Could Meanshift be the path that I'm looking for or is there a possibility of running into problems later? 

Regards,

Jure

-----Original Message-----
From: Jeff Eastman [mailto:jeastman@Narus.com] 
Sent: Thursday, November 18, 2010 1:02 AM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Canopy is a bit fussy about its T1 and T2 parameters: If you set T2 too small, you will get one cluster for each input vector; too large and you will get only one cluster for all vectors. T1 is less sensitive and will only impact how many points near each cluster are included in its centroid calculation.  My guess is you are in the first situation with T2 too small and, with the larger dataset, are creating more clusters than will fit into your memory.

How many clusters did you get from your small dataset? If the small set is a subset of the large set you could always run Canopy over the small set to get your k-means initial cluster centers, then run k-means iterations over the full dataset after. You can also skip the Canopy step entirely when using k-means: include a -k parameter and k-means will sample that many initial cluster centers from your data and then run its iterations. 

Glad to hear MeanShift is working for you. It has similar scaling limitations to Canopy. I've been pleasantly surprised by its performance on problems I thought were out of scope for it. Don't know why it works on your larger dataset when Canopy fails though.

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Wednesday, November 17, 2010 3:54 AM
To: user@mahout.apache.org
Subject: Canopy memory consumption

Hi Guys.

What I'm trying to do is the basic news clustering, that will group the news about the same topic into clusters.  I have the data in a database so I took the following approach:

1.       Wrote a small program that puts the data from the db into a Lucene Index.

2.       Created vectors from index with the following command:
mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i link -n 2

3.       Ran canopy, to get initial clusters:
mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow

4.       Ran the kmeans to perform the final clustering:
mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10 -cl -ow

5.       Do the clusterdump to view results:
mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p output-kmeans/clusteredPoints -dt text -b 100 -n 10 > result.txt

When I run this with cca 1000 records (8000 distinct terms), the results are just perfect. I get exactly the clusters I want. The problems start when I try the same steps with a bit more data.

With 6000 records (28000 terms) or even the half of that, the process fails at the canopy step with Java heap space OutOfMemoryError. The  MAHOUT_HEAPSIZE variable value on my local machine is 1024.  I even tried running it on our development hadoop cluster with approximately the same amount of memory, but it failed with the same error.

I realize  that software needs a certain amount of memory to work properly but I find it hard to believe that 1 GB is not enough for processing a 3.1 MB file, which is the size of the vectors file produced by the second step. We're hoping to use this solution on a hundreds of thousands of records and I can't help but to wonder what sort of hardware we'll be needing in order to process them if such memory consumption is a normal thing.

Am I missing something here? Are there any other setting that I should be taking into consideration.

And one more thing. I tried the meanshift implementation and it seems to be working fine, with that much data.

Thanks.

Jure

RE: Canopy memory consumption

Posted by Jure Jeseničnik <Ju...@planet9.si>.

Here's the folder that I am using as an input: http://dl.dropbox.com/u/9352657/input.zip. 
The results that I'm looking for should contain  somewhere around 5000 clusters. It might sound unusual but that's just the nature of our problem. 
We got the best results with the Meanshift (T1=1.0 T2=1.35). Results of this clustering were checked "by hand" and it was confirmed that this is what we are looking for (5013 clusters with some minor anomalies). Canopy failed with this values.

I would still like to do this the proper way, with T1>T2, but I seem to have trouble finding the proper input distances for a good result. I'm currently working on this but still I would appreciate any help you could give me on determining the proper distances. Trial & error seems like looking for a needle in a haystack.
T1=1.0 T2=0.6 gave me some results with Meanshift , bit the Canopy kept failing with this values also. I also tried the sequential approach, but it failed due to lack of memory too, it just took much, much longer.

As you mentioned yourself T1>T2 should probably be enforced and I would not like to rely on a solution that is based on a missing "sanity" check. Who knows what the future will bring.

Thanks.

Jure




-----Original Message-----
From: Jeff Eastman [mailto:jeastman@Narus.com] 
Sent: Thursday, November 18, 2010 5:38 PM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

900 clusters from 1000 vectors seems unusual. I'd be looking for a clustering that produced maybe 5-10% of that. Looking over your parameters, I notice your T1 value is less than T2. This violates the T1>T2 expectation for both Canopy and Mean Shift which is, apparently, not enforced. It probably should be and this might be the source of your problems but I'm not sure how this could cause a premature OME.

In terms of using Mean Shift, I'd say the proof of the pudding is in the eating. If it gives you reasonable results and can handle your data then it's all good. Canopy/k-Means is more of a main-stream approach and *should* scale better. I'd be interested in seeing a stack trace of where Canopy is bombing on you. A gig of memory should be more than enough to run your 3.1 Mb file - using sequential (-xm sequential) execution method, never mind using mapreduce!

Any chance you could share your input vectors file?

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Wednesday, November 17, 2010 11:02 PM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Hi Jeff

Thank you for your answer.  On a smaller scale I got around 10% less clusters than there are records (900 clusters from 1000 records). This corresponds with the actual data that I fed to the Canopy and I even checked the results manually and It was almost exactly what I wanted. A bit more fiddeling with the T1 and T2 and it would have been it. 
When I run  the Meanshift with the same T1 and T2 it is able to process 6000 clusters with ease. On  the cases where I was able to get the Canopy+k-means through, the results seemed pretty similar of those that he Meanshift gave me. 

Could Meanshift be the path that I'm looking for or is there a possibility of running into problems later? 

Regards,

Jure


-----Original Message-----
From: Jeff Eastman [mailto:jeastman@Narus.com] 
Sent: Thursday, November 18, 2010 1:02 AM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Canopy is a bit fussy about its T1 and T2 parameters: If you set T2 too small, you will get one cluster for each input vector; too large and you will get only one cluster for all vectors. T1 is less sensitive and will only impact how many points near each cluster are included in its centroid calculation.  My guess is you are in the first situation with T2 too small and, with the larger dataset, are creating more clusters than will fit into your memory.

How many clusters did you get from your small dataset? If the small set is a subset of the large set you could always run Canopy over the small set to get your k-means initial cluster centers, then run k-means iterations over the full dataset after. You can also skip the Canopy step entirely when using k-means: include a -k parameter and k-means will sample that many initial cluster centers from your data and then run its iterations. 

Glad to hear MeanShift is working for you. It has similar scaling limitations to Canopy. I've been pleasantly surprised by its performance on problems I thought were out of scope for it. Don't know why it works on your larger dataset when Canopy fails though.

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Wednesday, November 17, 2010 3:54 AM
To: user@mahout.apache.org
Subject: Canopy memory consumption

Hi Guys.

What I'm trying to do is the basic news clustering, that will group the news about the same topic into clusters.  I have the data in a database so I took the following approach:

1.       Wrote a small program that puts the data from the db into a Lucene Index.

2.       Created vectors from index with the following command:
mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i link -n 2

3.       Ran canopy, to get initial clusters:
mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow

4.       Ran the kmeans to perform the final clustering:
mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10 -cl -ow

5.       Do the clusterdump to view results:
mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p output-kmeans/clusteredPoints -dt text -b 100 -n 10 > result.txt

When I run this with cca 1000 records (8000 distinct terms), the results are just perfect. I get exactly the clusters I want. The problems start when I try the same steps with a bit more data.

With 6000 records (28000 terms) or even the half of that, the process fails at the canopy step with Java heap space OutOfMemoryError. The  MAHOUT_HEAPSIZE variable value on my local machine is 1024.  I even tried running it on our development hadoop cluster with approximately the same amount of memory, but it failed with the same error.

I realize  that software needs a certain amount of memory to work properly but I find it hard to believe that 1 GB is not enough for processing a 3.1 MB file, which is the size of the vectors file produced by the second step. We're hoping to use this solution on a hundreds of thousands of records and I can't help but to wonder what sort of hardware we'll be needing in order to process them if such memory consumption is a normal thing.

Am I missing something here? Are there any other setting that I should be taking into consideration.

And one more thing. I tried the meanshift implementation and it seems to be working fine, with that much data.

Thanks.

Jure

Re: Canopy memory consumption

Posted by Ted Dunning <te...@gmail.com>.

If k-means is trying to maintain too many clusters, then it will use way
more memory and run much more slowly.

That alone could be the genesis of the problem.

2010/11/18 Jeff Eastman <je...@narus.com>

> 900 clusters from 1000 vectors seems unusual. I'd be looking for a
> clustering that produced maybe 5-10% of that. Looking over your parameters,
> I notice your T1 value is less than T2. This violates the T1>T2 expectation
> for both Canopy and Mean Shift which is, apparently, not enforced. It
> probably should be and this might be the source of your problems but I'm not
> sure how this could cause a premature OME.
>
> In terms of using Mean Shift, I'd say the proof of the pudding is in the
> eating. If it gives you reasonable results and can handle your data then
> it's all good. Canopy/k-Means is more of a main-stream approach and *should*
> scale better. I'd be interested in seeing a stack trace of where Canopy is
> bombing on you. A gig of memory should be more than enough to run your 3.1
> Mb file - using sequential (-xm sequential) execution method, never mind
> using mapreduce!
>
> Any chance you could share your input vectors file?
>
> -----Original Message-----
> From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si]
> Sent: Wednesday, November 17, 2010 11:02 PM
> To: user@mahout.apache.org
> Subject: RE: Canopy memory consumption
>
> Hi Jeff
>
> Thank you for your answer.  On a smaller scale I got around 10% less
> clusters than there are records (900 clusters from 1000 records). This
> corresponds with the actual data that I fed to the Canopy and I even checked
> the results manually and It was almost exactly what I wanted. A bit more
> fiddeling with the T1 and T2 and it would have been it.
> When I run  the Meanshift with the same T1 and T2 it is able to process
> 6000 clusters with ease. On  the cases where I was able to get the
> Canopy+k-means through, the results seemed pretty similar of those that he
> Meanshift gave me.
>
> Could Meanshift be the path that I'm looking for or is there a possibility
> of running into problems later?
>
> Regards,
>
> Jure
>
>
> -----Original Message-----
> From: Jeff Eastman [mailto:jeastman@Narus.com]
> Sent: Thursday, November 18, 2010 1:02 AM
> To: user@mahout.apache.org
> Subject: RE: Canopy memory consumption
>
> Canopy is a bit fussy about its T1 and T2 parameters: If you set T2 too
> small, you will get one cluster for each input vector; too large and you
> will get only one cluster for all vectors. T1 is less sensitive and will
> only impact how many points near each cluster are included in its centroid
> calculation.  My guess is you are in the first situation with T2 too small
> and, with the larger dataset, are creating more clusters than will fit into
> your memory.
>
> How many clusters did you get from your small dataset? If the small set is
> a subset of the large set you could always run Canopy over the small set to
> get your k-means initial cluster centers, then run k-means iterations over
> the full dataset after. You can also skip the Canopy step entirely when
> using k-means: include a -k parameter and k-means will sample that many
> initial cluster centers from your data and then run its iterations.
>
> Glad to hear MeanShift is working for you. It has similar scaling
> limitations to Canopy. I've been pleasantly surprised by its performance on
> problems I thought were out of scope for it. Don't know why it works on your
> larger dataset when Canopy fails though.
>
> -----Original Message-----
> From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si]
> Sent: Wednesday, November 17, 2010 3:54 AM
> To: user@mahout.apache.org
> Subject: Canopy memory consumption
>
> Hi Guys.
>
> What I'm trying to do is the basic news clustering, that will group the
> news about the same topic into clusters.  I have the data in a database so I
> took the following approach:
>
> 1.       Wrote a small program that puts the data from the db into a Lucene
> Index.
>
> 2.       Created vectors from index with the following command:
> mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i
> link -n 2
>
> 3.       Ran canopy, to get initial clusters:
> mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow
>
> 4.       Ran the kmeans to perform the final clustering:
> mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10
> -cl -ow
>
> 5.       Do the clusterdump to view results:
> mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p
> output-kmeans/clusteredPoints -dt text -b 100 -n 10 > result.txt
>
> When I run this with cca 1000 records (8000 distinct terms), the results
> are just perfect. I get exactly the clusters I want. The problems start when
> I try the same steps with a bit more data.
>
> With 6000 records (28000 terms) or even the half of that, the process fails
> at the canopy step with Java heap space OutOfMemoryError. The
>  MAHOUT_HEAPSIZE variable value on my local machine is 1024.  I even tried
> running it on our development hadoop cluster with approximately the same
> amount of memory, but it failed with the same error.
>
> I realize  that software needs a certain amount of memory to work properly
> but I find it hard to believe that 1 GB is not enough for processing a 3.1
> MB file, which is the size of the vectors file produced by the second step.
> We're hoping to use this solution on a hundreds of thousands of records and
> I can't help but to wonder what sort of hardware we'll be needing in order
> to process them if such memory consumption is a normal thing.
>
> Am I missing something here? Are there any other setting that I should be
> taking into consideration.
>
> And one more thing. I tried the meanshift implementation and it seems to be
> working fine, with that much data.
>
> Thanks.
>
> Jure
>
>

RE: Canopy memory consumption

Posted by Jeff Eastman <je...@Narus.com>.

900 clusters from 1000 vectors seems unusual. I'd be looking for a clustering that produced maybe 5-10% of that. Looking over your parameters, I notice your T1 value is less than T2. This violates the T1>T2 expectation for both Canopy and Mean Shift which is, apparently, not enforced. It probably should be and this might be the source of your problems but I'm not sure how this could cause a premature OME.

In terms of using Mean Shift, I'd say the proof of the pudding is in the eating. If it gives you reasonable results and can handle your data then it's all good. Canopy/k-Means is more of a main-stream approach and *should* scale better. I'd be interested in seeing a stack trace of where Canopy is bombing on you. A gig of memory should be more than enough to run your 3.1 Mb file - using sequential (-xm sequential) execution method, never mind using mapreduce!

Any chance you could share your input vectors file?

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Wednesday, November 17, 2010 11:02 PM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Hi Jeff

Thank you for your answer.  On a smaller scale I got around 10% less clusters than there are records (900 clusters from 1000 records). This corresponds with the actual data that I fed to the Canopy and I even checked the results manually and It was almost exactly what I wanted. A bit more fiddeling with the T1 and T2 and it would have been it. 
When I run  the Meanshift with the same T1 and T2 it is able to process 6000 clusters with ease. On  the cases where I was able to get the Canopy+k-means through, the results seemed pretty similar of those that he Meanshift gave me. 

Could Meanshift be the path that I'm looking for or is there a possibility of running into problems later? 

Regards,

Jure

-----Original Message-----
From: Jeff Eastman [mailto:jeastman@Narus.com] 
Sent: Thursday, November 18, 2010 1:02 AM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Canopy is a bit fussy about its T1 and T2 parameters: If you set T2 too small, you will get one cluster for each input vector; too large and you will get only one cluster for all vectors. T1 is less sensitive and will only impact how many points near each cluster are included in its centroid calculation.  My guess is you are in the first situation with T2 too small and, with the larger dataset, are creating more clusters than will fit into your memory.

How many clusters did you get from your small dataset? If the small set is a subset of the large set you could always run Canopy over the small set to get your k-means initial cluster centers, then run k-means iterations over the full dataset after. You can also skip the Canopy step entirely when using k-means: include a -k parameter and k-means will sample that many initial cluster centers from your data and then run its iterations. 

Glad to hear MeanShift is working for you. It has similar scaling limitations to Canopy. I've been pleasantly surprised by its performance on problems I thought were out of scope for it. Don't know why it works on your larger dataset when Canopy fails though.

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Wednesday, November 17, 2010 3:54 AM
To: user@mahout.apache.org
Subject: Canopy memory consumption

Hi Guys.

What I'm trying to do is the basic news clustering, that will group the news about the same topic into clusters.  I have the data in a database so I took the following approach:

1.       Wrote a small program that puts the data from the db into a Lucene Index.

2.       Created vectors from index with the following command:
mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i link -n 2

3.       Ran canopy, to get initial clusters:
mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow

4.       Ran the kmeans to perform the final clustering:
mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10 -cl -ow

5.       Do the clusterdump to view results:
mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p output-kmeans/clusteredPoints -dt text -b 100 -n 10 > result.txt

When I run this with cca 1000 records (8000 distinct terms), the results are just perfect. I get exactly the clusters I want. The problems start when I try the same steps with a bit more data.

With 6000 records (28000 terms) or even the half of that, the process fails at the canopy step with Java heap space OutOfMemoryError. The  MAHOUT_HEAPSIZE variable value on my local machine is 1024.  I even tried running it on our development hadoop cluster with approximately the same amount of memory, but it failed with the same error.

I realize  that software needs a certain amount of memory to work properly but I find it hard to believe that 1 GB is not enough for processing a 3.1 MB file, which is the size of the vectors file produced by the second step. We're hoping to use this solution on a hundreds of thousands of records and I can't help but to wonder what sort of hardware we'll be needing in order to process them if such memory consumption is a normal thing.

Am I missing something here? Are there any other setting that I should be taking into consideration.

And one more thing. I tried the meanshift implementation and it seems to be working fine, with that much data.

Thanks.

Jure

RE: Canopy memory consumption

Posted by Jure Jeseničnik <Ju...@planet9.si>.

Hi Jeff

Thank you for your answer.  On a smaller scale I got around 10% less clusters than there are records (900 clusters from 1000 records). This corresponds with the actual data that I fed to the Canopy and I even checked the results manually and It was almost exactly what I wanted. A bit more fiddeling with the T1 and T2 and it would have been it. 
When I run  the Meanshift with the same T1 and T2 it is able to process 6000 clusters with ease. On  the cases where I was able to get the Canopy+k-means through, the results seemed pretty similar of those that he Meanshift gave me. 

Could Meanshift be the path that I'm looking for or is there a possibility of running into problems later? 

Regards,

Jure


-----Original Message-----
From: Jeff Eastman [mailto:jeastman@Narus.com] 
Sent: Thursday, November 18, 2010 1:02 AM
To: user@mahout.apache.org
Subject: RE: Canopy memory consumption

Canopy is a bit fussy about its T1 and T2 parameters: If you set T2 too small, you will get one cluster for each input vector; too large and you will get only one cluster for all vectors. T1 is less sensitive and will only impact how many points near each cluster are included in its centroid calculation.  My guess is you are in the first situation with T2 too small and, with the larger dataset, are creating more clusters than will fit into your memory.

How many clusters did you get from your small dataset? If the small set is a subset of the large set you could always run Canopy over the small set to get your k-means initial cluster centers, then run k-means iterations over the full dataset after. You can also skip the Canopy step entirely when using k-means: include a -k parameter and k-means will sample that many initial cluster centers from your data and then run its iterations. 

Glad to hear MeanShift is working for you. It has similar scaling limitations to Canopy. I've been pleasantly surprised by its performance on problems I thought were out of scope for it. Don't know why it works on your larger dataset when Canopy fails though.

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Wednesday, November 17, 2010 3:54 AM
To: user@mahout.apache.org
Subject: Canopy memory consumption

Hi Guys.

What I'm trying to do is the basic news clustering, that will group the news about the same topic into clusters.  I have the data in a database so I took the following approach:

1.       Wrote a small program that puts the data from the db into a Lucene Index.

2.       Created vectors from index with the following command:
mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i link -n 2

3.       Ran canopy, to get initial clusters:
mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow

4.       Ran the kmeans to perform the final clustering:
mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10 -cl -ow

5.       Do the clusterdump to view results:
mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p output-kmeans/clusteredPoints -dt text -b 100 -n 10 > result.txt

When I run this with cca 1000 records (8000 distinct terms), the results are just perfect. I get exactly the clusters I want. The problems start when I try the same steps with a bit more data.

With 6000 records (28000 terms) or even the half of that, the process fails at the canopy step with Java heap space OutOfMemoryError. The  MAHOUT_HEAPSIZE variable value on my local machine is 1024.  I even tried running it on our development hadoop cluster with approximately the same amount of memory, but it failed with the same error.

I realize  that software needs a certain amount of memory to work properly but I find it hard to believe that 1 GB is not enough for processing a 3.1 MB file, which is the size of the vectors file produced by the second step. We're hoping to use this solution on a hundreds of thousands of records and I can't help but to wonder what sort of hardware we'll be needing in order to process them if such memory consumption is a normal thing.

Am I missing something here? Are there any other setting that I should be taking into consideration.

And one more thing. I tried the meanshift implementation and it seems to be working fine, with that much data.

Thanks.

Jure

RE: Canopy memory consumption

Posted by Jeff Eastman <je...@Narus.com>.

Canopy is a bit fussy about its T1 and T2 parameters: If you set T2 too small, you will get one cluster for each input vector; too large and you will get only one cluster for all vectors. T1 is less sensitive and will only impact how many points near each cluster are included in its centroid calculation.  My guess is you are in the first situation with T2 too small and, with the larger dataset, are creating more clusters than will fit into your memory.

How many clusters did you get from your small dataset? If the small set is a subset of the large set you could always run Canopy over the small set to get your k-means initial cluster centers, then run k-means iterations over the full dataset after. You can also skip the Canopy step entirely when using k-means: include a -k parameter and k-means will sample that many initial cluster centers from your data and then run its iterations. 

Glad to hear MeanShift is working for you. It has similar scaling limitations to Canopy. I've been pleasantly surprised by its performance on problems I thought were out of scope for it. Don't know why it works on your larger dataset when Canopy fails though.

-----Original Message-----
From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
Sent: Wednesday, November 17, 2010 3:54 AM
To: user@mahout.apache.org
Subject: Canopy memory consumption

Hi Guys.

What I'm trying to do is the basic news clustering, that will group the news about the same topic into clusters.  I have the data in a database so I took the following approach:

1.       Wrote a small program that puts the data from the db into a Lucene Index.

2.       Created vectors from index with the following command:
mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i link -n 2

3.       Ran canopy, to get initial clusters:
mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow

4.       Ran the kmeans to perform the final clustering:
mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10 -cl -ow

5.       Do the clusterdump to view results:
mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p output-kmeans/clusteredPoints -dt text -b 100 -n 10 > result.txt

When I run this with cca 1000 records (8000 distinct terms), the results are just perfect. I get exactly the clusters I want. The problems start when I try the same steps with a bit more data.

With 6000 records (28000 terms) or even the half of that, the process fails at the canopy step with Java heap space OutOfMemoryError. The  MAHOUT_HEAPSIZE variable value on my local machine is 1024.  I even tried running it on our development hadoop cluster with approximately the same amount of memory, but it failed with the same error.

I realize  that software needs a certain amount of memory to work properly but I find it hard to believe that 1 GB is not enough for processing a 3.1 MB file, which is the size of the vectors file produced by the second step. We're hoping to use this solution on a hundreds of thousands of records and I can't help but to wonder what sort of hardware we'll be needing in order to process them if such memory consumption is a normal thing.

Am I missing something here? Are there any other setting that I should be taking into consideration.

And one more thing. I tried the meanshift implementation and it seems to be working fine, with that much data.

Thanks.

Jure