You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Matt Molek <mp...@gmail.com> on 2012/10/20 21:44:37 UTC

clusterpp is only writing directories for about half of my clusters.

First off, thank you everyone for your help so far. This mailing list
has been a great help getting me up and running with Mahout

Right now, I'm clustering a set of ~3M documents into 300 clusters.
Then I'm using clusterpp to split the documents up into directories
containing the vectors belonging to each cluster. After I perform the
clustering, clusterdump shows that each cluster has between ~800 and
~200,000 documents. This isn't a great spread, but the point is that
none of the clusters are empty.

Here are my commands:

bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
-dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
-k 300 -x 15 -cl -ow

bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt

bin/mahout clusterpp -i pca-clusters -o bottom


Since none of my clusters are empty, I would expect clusterpp to
create 300 directories in "bottom", one for each cluster. Instead,
only 147 directories are created. The other 153 outputs are just empty
part-r-* files sitting in the "bottom" directory.

I haven't found too much information when searching on this issue but
I did come across one mailing list post from a while back:
http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3C4F3E52FC.7000000@windwardsolutions.com%3E

In that discussion someone said, "If that is the only thing that is
contained in the part-r-* file [it had no vectors], then the reducer
responsible to write to that part-r-* file did not receive any input
records to write to it. This happens because the program uses the
default hash partitioner which sometimes maps records belonging to
different clusters to a same reducer; thus leaving some reducers
without any input records."

So if that's correct, is that what's happening to me? Half of my
clusters are being sent to the overlapping reducers? That seems like a
big issue, making clusterpp pretty much useless for my purposes. I
can't have documents randomly being sent to the wrong cluster's
directory, especially not 50+% of them.

One final detail: I'm not sure if this matters, but the clusters
output by kmeans are not numbered 1 to 300. They have an odd looking,
nonsequential numbering sequence. The first 5 clusters are:
VL-3740844
VL-3741044
VL-3741140
VL-3741161
VL-3741235

I haven't done much with kmeans before, so I wasn't sure if this was
an unexpected behavior or not.

Re: clusterpp is only writing directories for about half of my clusters.

Posted by paritosh ranjan <pa...@gmail.com>.
The partitioner can be changed in postProcessMR method of
ClusterOutputPostProcessorDriver class.

On Sun, Oct 21, 2012 at 1:54 AM, paritosh ranjan
<pa...@gmail.com>wrote:

> "So if that's correct, is that what's happening to me? Half of my
> clusters are being sent to the overlapping reducers? That seems like a
> big issue, making clusterpp pretty much useless for my purposes. I
> can't have documents randomly being sent to the wrong cluster's
> directory, especially not 50+% of them."
>
> This might be correct. I think this can occur if the number of clusters is
> large, and the testing was not done with so many clusters.
> Can you help a bit in testing some scenarios?
>
> a) Try reducing the number of clusters to 100 and then 50. The motto is to
> find the breaking point (number of clusters) after which the clusters start
> converging. If this is found, then we would be sure that the problem lies
> in the partitioner.
> b) If you want, try to use a different partitioner/s. The idea is to
> create as many reducer tasks as the number of ( non empty ) clusters found,
> so that vectors present in each cluster is in a separate file and later
> they are moved to their respective directories ( named on cluster id ).
>
> Please also create a JIRA for this.
> https://issues.apache.org/jira/browse/MAHOUT.
> And if you are interested, this would be a good starting point to
> contribute to Mahout also.
>
> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <mp...@gmail.com> wrote:
>
>> First off, thank you everyone for your help so far. This mailing list
>> has been a great help getting me up and running with Mahout
>>
>> Right now, I'm clustering a set of ~3M documents into 300 clusters.
>> Then I'm using clusterpp to split the documents up into directories
>> containing the vectors belonging to each cluster. After I perform the
>> clustering, clusterdump shows that each cluster has between ~800 and
>> ~200,000 documents. This isn't a great spread, but the point is that
>> none of the clusters are empty.
>>
>> Here are my commands:
>>
>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
>> -k 300 -x 15 -cl -ow
>>
>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o
>> clusterdump.txt
>>
>> bin/mahout clusterpp -i pca-clusters -o bottom
>>
>>
>> Since none of my clusters are empty, I would expect clusterpp to
>> create 300 directories in "bottom", one for each cluster. Instead,
>> only 147 directories are created. The other 153 outputs are just empty
>> part-r-* files sitting in the "bottom" directory.
>>
>> I haven't found too much information when searching on this issue but
>> I did come across one mailing list post from a while back:
>>
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3C4F3E52FC.7000000@windwardsolutions.com%3E
>>
>> In that discussion someone said, "If that is the only thing that is
>> contained in the part-r-* file [it had no vectors], then the reducer
>> responsible to write to that part-r-* file did not receive any input
>> records to write to it. This happens because the program uses the
>> default hash partitioner which sometimes maps records belonging to
>> different clusters to a same reducer; thus leaving some reducers
>> without any input records."
>>
>> So if that's correct, is that what's happening to me? Half of my
>> clusters are being sent to the overlapping reducers? That seems like a
>> big issue, making clusterpp pretty much useless for my purposes. I
>> can't have documents randomly being sent to the wrong cluster's
>> directory, especially not 50+% of them.
>>
>> One final detail: I'm not sure if this matters, but the clusters
>> output by kmeans are not numbered 1 to 300. They have an odd looking,
>> nonsequential numbering sequence. The first 5 clusters are:
>> VL-3740844
>> VL-3741044
>> VL-3741140
>> VL-3741161
>> VL-3741235
>>
>> I haven't done much with kmeans before, so I wasn't sure if this was
>> an unexpected behavior or not.
>>
>
>

Re: clusterpp is only writing directories for about half of my clusters.

Posted by Matt Molek <mp...@gmail.com>.
That's all very helpful. Thanks for you input!


On Mon, Oct 22, 2012 at 2:35 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> PPS finally if you decide to prototype stuff in R with exact SSVD and
> PCA analogue of Mahout's SSVD with R, we have prototyped them first
> too before moving to MR implementation so you can use that in your
> prototype too if you want to make sure you have very similar
> stochasticity effects, see "R simulation" paragraph here
> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
> to download the R prototype code of single-threaded SSVD/PCA versions
> of Mahout.
>
> hope that helps.
>
> On Mon, Oct 22, 2012 at 11:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> Regardless of what you are trying to do, the best practice is actually
>> prototype the process in R or matlab first to make sure you are
>> getting results that make sense to you. Then if you have figured out
>> what seems to be working, you can turn to large scale. SSVD is just
>> svd in R, and i haven't used k-means or any other clustering there but
>> i am sure it is available there too.
>>
>> Same goes for the sphere projections and pca.
>>
>>
>>
>> On Mon, Oct 22, 2012 at 11:13 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> i meant, "soft clustering"
>>>
>>> On Mon, Oct 22, 2012 at 11:06 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> from Jira:
>>>>
>>>>> Hi Dmitriy, sorry for going a little off topic here, but could you elaborate on this? I've been experimenting with using either cosine or tanimoto distance on the USigma output of ssvd with -pca true. Are those not appropriate distance measures for the -pca output?
>>>>
>>>> Let somebody correct me if i am talking nonsense here...
>>>>
>>>> Strictly speaking, you can find clusters using L2 distance (i.e.
>>>> euclidean distance). In that case, PCA helps you by reducing
>>>> functionality, and then USigma output will preserve original distances
>>>> (or at least proportions of those). K means with L2 will then work a
>>>> little faster.
>>>>
>>>> But... with cosine and Tanimoto, PCA does not preserve those due to
>>>> recentering of the original data, therefore, dimensionality reduction
>>>> doesn't work as much for these types of things. Here you basically
>>>> have just to recourses: 1) do LSA ( in terms of SSVD, it means --pca
>>>> false and take U output for document topic space), or 2) perhaps do
>>>> sphere projection first and then do dimensionality reduction with
>>>> --pca true. the latter will at least preserve cosine distances as far
>>>> as i can tell. But standard way to address topical "sort clustering"
>>>> with text is still LSA. (if that's your goal, within Mahout realm i
>>>> probably also need to draw your attention to LDA-cvb method in Mahout,
>>>> various researches say LDA actually does better job in finding topic
>>>> mixtures).
>>>>
>>>> On Mon, Oct 22, 2012 at 7:29 AM, Matt Molek <mp...@gmail.com> wrote:
>>>>> I've done some more testing and submitted a JIRA:
>>>>> https://issues.apache.org/jira/browse/MAHOUT-1103
>>>>>
>>>>> On Sat, Oct 20, 2012 at 9:01 PM, Matt Molek <mp...@gmail.com> wrote:
>>>>>> Thanks for the quick response!
>>>>>>
>>>>>> I will do some testing tomorrow with various numbers of clusters and
>>>>>> create a JIRA once I have those results. I might be able to contribute
>>>>>> a patch for this if I have the time.
>>>>>>
>>>>>> On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan
>>>>>> <pa...@gmail.com> wrote:
>>>>>>> "So if that's correct, is that what's happening to me? Half of my
>>>>>>> clusters are being sent to the overlapping reducers? That seems like a
>>>>>>> big issue, making clusterpp pretty much useless for my purposes. I
>>>>>>> can't have documents randomly being sent to the wrong cluster's
>>>>>>> directory, especially not 50+% of them."
>>>>>>>
>>>>>>> This might be correct. I think this can occur if the number of clusters is
>>>>>>> large, and the testing was not done with so many clusters.
>>>>>>> Can you help a bit in testing some scenarios?
>>>>>>>
>>>>>>> a) Try reducing the number of clusters to 100 and then 50. The motto is to
>>>>>>> find the breaking point (number of clusters) after which the clusters start
>>>>>>> converging. If this is found, then we would be sure that the problem lies
>>>>>>> in the partitioner.
>>>>>>> b) If you want, try to use a different partitioner/s. The idea is to create
>>>>>>> as many reducer tasks as the number of ( non empty ) clusters found, so
>>>>>>> that vectors present in each cluster is in a separate file and later they
>>>>>>> are moved to their respective directories ( named on cluster id ).
>>>>>>>
>>>>>>> Please also create a JIRA for this.
>>>>>>> https://issues.apache.org/jira/browse/MAHOUT.
>>>>>>> And if you are interested, this would be a good starting point to
>>>>>>> contribute to Mahout also.
>>>>>>>
>>>>>>> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <mp...@gmail.com> wrote:
>>>>>>>
>>>>>>>> First off, thank you everyone for your help so far. This mailing list
>>>>>>>> has been a great help getting me up and running with Mahout
>>>>>>>>
>>>>>>>> Right now, I'm clustering a set of ~3M documents into 300 clusters.
>>>>>>>> Then I'm using clusterpp to split the documents up into directories
>>>>>>>> containing the vectors belonging to each cluster. After I perform the
>>>>>>>> clustering, clusterdump shows that each cluster has between ~800 and
>>>>>>>> ~200,000 documents. This isn't a great spread, but the point is that
>>>>>>>> none of the clusters are empty.
>>>>>>>>
>>>>>>>> Here are my commands:
>>>>>>>>
>>>>>>>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
>>>>>>>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
>>>>>>>> -k 300 -x 15 -cl -ow
>>>>>>>>
>>>>>>>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
>>>>>>>>
>>>>>>>> bin/mahout clusterpp -i pca-clusters -o bottom
>>>>>>>>
>>>>>>>>
>>>>>>>> Since none of my clusters are empty, I would expect clusterpp to
>>>>>>>> create 300 directories in "bottom", one for each cluster. Instead,
>>>>>>>> only 147 directories are created. The other 153 outputs are just empty
>>>>>>>> part-r-* files sitting in the "bottom" directory.
>>>>>>>>
>>>>>>>> I haven't found too much information when searching on this issue but
>>>>>>>> I did come across one mailing list post from a while back:
>>>>>>>>
>>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3C4F3E52FC.7000000@windwardsolutions.com%3E
>>>>>>>>
>>>>>>>> In that discussion someone said, "If that is the only thing that is
>>>>>>>> contained in the part-r-* file [it had no vectors], then the reducer
>>>>>>>> responsible to write to that part-r-* file did not receive any input
>>>>>>>> records to write to it. This happens because the program uses the
>>>>>>>> default hash partitioner which sometimes maps records belonging to
>>>>>>>> different clusters to a same reducer; thus leaving some reducers
>>>>>>>> without any input records."
>>>>>>>>
>>>>>>>> So if that's correct, is that what's happening to me? Half of my
>>>>>>>> clusters are being sent to the overlapping reducers? That seems like a
>>>>>>>> big issue, making clusterpp pretty much useless for my purposes. I
>>>>>>>> can't have documents randomly being sent to the wrong cluster's
>>>>>>>> directory, especially not 50+% of them.
>>>>>>>>
>>>>>>>> One final detail: I'm not sure if this matters, but the clusters
>>>>>>>> output by kmeans are not numbered 1 to 300. They have an odd looking,
>>>>>>>> nonsequential numbering sequence. The first 5 clusters are:
>>>>>>>> VL-3740844
>>>>>>>> VL-3741044
>>>>>>>> VL-3741140
>>>>>>>> VL-3741161
>>>>>>>> VL-3741235
>>>>>>>>
>>>>>>>> I haven't done much with kmeans before, so I wasn't sure if this was
>>>>>>>> an unexpected behavior or not.
>>>>>>>>

Re: clusterpp is only writing directories for about half of my clusters.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
PPS finally if you decide to prototype stuff in R with exact SSVD and
PCA analogue of Mahout's SSVD with R, we have prototyped them first
too before moving to MR implementation so you can use that in your
prototype too if you want to make sure you have very similar
stochasticity effects, see "R simulation" paragraph here
https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
to download the R prototype code of single-threaded SSVD/PCA versions
of Mahout.

hope that helps.

On Mon, Oct 22, 2012 at 11:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Regardless of what you are trying to do, the best practice is actually
> prototype the process in R or matlab first to make sure you are
> getting results that make sense to you. Then if you have figured out
> what seems to be working, you can turn to large scale. SSVD is just
> svd in R, and i haven't used k-means or any other clustering there but
> i am sure it is available there too.
>
> Same goes for the sphere projections and pca.
>
>
>
> On Mon, Oct 22, 2012 at 11:13 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> i meant, "soft clustering"
>>
>> On Mon, Oct 22, 2012 at 11:06 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> from Jira:
>>>
>>>> Hi Dmitriy, sorry for going a little off topic here, but could you elaborate on this? I've been experimenting with using either cosine or tanimoto distance on the USigma output of ssvd with -pca true. Are those not appropriate distance measures for the -pca output?
>>>
>>> Let somebody correct me if i am talking nonsense here...
>>>
>>> Strictly speaking, you can find clusters using L2 distance (i.e.
>>> euclidean distance). In that case, PCA helps you by reducing
>>> functionality, and then USigma output will preserve original distances
>>> (or at least proportions of those). K means with L2 will then work a
>>> little faster.
>>>
>>> But... with cosine and Tanimoto, PCA does not preserve those due to
>>> recentering of the original data, therefore, dimensionality reduction
>>> doesn't work as much for these types of things. Here you basically
>>> have just to recourses: 1) do LSA ( in terms of SSVD, it means --pca
>>> false and take U output for document topic space), or 2) perhaps do
>>> sphere projection first and then do dimensionality reduction with
>>> --pca true. the latter will at least preserve cosine distances as far
>>> as i can tell. But standard way to address topical "sort clustering"
>>> with text is still LSA. (if that's your goal, within Mahout realm i
>>> probably also need to draw your attention to LDA-cvb method in Mahout,
>>> various researches say LDA actually does better job in finding topic
>>> mixtures).
>>>
>>> On Mon, Oct 22, 2012 at 7:29 AM, Matt Molek <mp...@gmail.com> wrote:
>>>> I've done some more testing and submitted a JIRA:
>>>> https://issues.apache.org/jira/browse/MAHOUT-1103
>>>>
>>>> On Sat, Oct 20, 2012 at 9:01 PM, Matt Molek <mp...@gmail.com> wrote:
>>>>> Thanks for the quick response!
>>>>>
>>>>> I will do some testing tomorrow with various numbers of clusters and
>>>>> create a JIRA once I have those results. I might be able to contribute
>>>>> a patch for this if I have the time.
>>>>>
>>>>> On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan
>>>>> <pa...@gmail.com> wrote:
>>>>>> "So if that's correct, is that what's happening to me? Half of my
>>>>>> clusters are being sent to the overlapping reducers? That seems like a
>>>>>> big issue, making clusterpp pretty much useless for my purposes. I
>>>>>> can't have documents randomly being sent to the wrong cluster's
>>>>>> directory, especially not 50+% of them."
>>>>>>
>>>>>> This might be correct. I think this can occur if the number of clusters is
>>>>>> large, and the testing was not done with so many clusters.
>>>>>> Can you help a bit in testing some scenarios?
>>>>>>
>>>>>> a) Try reducing the number of clusters to 100 and then 50. The motto is to
>>>>>> find the breaking point (number of clusters) after which the clusters start
>>>>>> converging. If this is found, then we would be sure that the problem lies
>>>>>> in the partitioner.
>>>>>> b) If you want, try to use a different partitioner/s. The idea is to create
>>>>>> as many reducer tasks as the number of ( non empty ) clusters found, so
>>>>>> that vectors present in each cluster is in a separate file and later they
>>>>>> are moved to their respective directories ( named on cluster id ).
>>>>>>
>>>>>> Please also create a JIRA for this.
>>>>>> https://issues.apache.org/jira/browse/MAHOUT.
>>>>>> And if you are interested, this would be a good starting point to
>>>>>> contribute to Mahout also.
>>>>>>
>>>>>> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <mp...@gmail.com> wrote:
>>>>>>
>>>>>>> First off, thank you everyone for your help so far. This mailing list
>>>>>>> has been a great help getting me up and running with Mahout
>>>>>>>
>>>>>>> Right now, I'm clustering a set of ~3M documents into 300 clusters.
>>>>>>> Then I'm using clusterpp to split the documents up into directories
>>>>>>> containing the vectors belonging to each cluster. After I perform the
>>>>>>> clustering, clusterdump shows that each cluster has between ~800 and
>>>>>>> ~200,000 documents. This isn't a great spread, but the point is that
>>>>>>> none of the clusters are empty.
>>>>>>>
>>>>>>> Here are my commands:
>>>>>>>
>>>>>>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
>>>>>>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
>>>>>>> -k 300 -x 15 -cl -ow
>>>>>>>
>>>>>>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
>>>>>>>
>>>>>>> bin/mahout clusterpp -i pca-clusters -o bottom
>>>>>>>
>>>>>>>
>>>>>>> Since none of my clusters are empty, I would expect clusterpp to
>>>>>>> create 300 directories in "bottom", one for each cluster. Instead,
>>>>>>> only 147 directories are created. The other 153 outputs are just empty
>>>>>>> part-r-* files sitting in the "bottom" directory.
>>>>>>>
>>>>>>> I haven't found too much information when searching on this issue but
>>>>>>> I did come across one mailing list post from a while back:
>>>>>>>
>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3C4F3E52FC.7000000@windwardsolutions.com%3E
>>>>>>>
>>>>>>> In that discussion someone said, "If that is the only thing that is
>>>>>>> contained in the part-r-* file [it had no vectors], then the reducer
>>>>>>> responsible to write to that part-r-* file did not receive any input
>>>>>>> records to write to it. This happens because the program uses the
>>>>>>> default hash partitioner which sometimes maps records belonging to
>>>>>>> different clusters to a same reducer; thus leaving some reducers
>>>>>>> without any input records."
>>>>>>>
>>>>>>> So if that's correct, is that what's happening to me? Half of my
>>>>>>> clusters are being sent to the overlapping reducers? That seems like a
>>>>>>> big issue, making clusterpp pretty much useless for my purposes. I
>>>>>>> can't have documents randomly being sent to the wrong cluster's
>>>>>>> directory, especially not 50+% of them.
>>>>>>>
>>>>>>> One final detail: I'm not sure if this matters, but the clusters
>>>>>>> output by kmeans are not numbered 1 to 300. They have an odd looking,
>>>>>>> nonsequential numbering sequence. The first 5 clusters are:
>>>>>>> VL-3740844
>>>>>>> VL-3741044
>>>>>>> VL-3741140
>>>>>>> VL-3741161
>>>>>>> VL-3741235
>>>>>>>
>>>>>>> I haven't done much with kmeans before, so I wasn't sure if this was
>>>>>>> an unexpected behavior or not.
>>>>>>>

Re: clusterpp is only writing directories for about half of my clusters.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Regardless of what you are trying to do, the best practice is actually
prototype the process in R or matlab first to make sure you are
getting results that make sense to you. Then if you have figured out
what seems to be working, you can turn to large scale. SSVD is just
svd in R, and i haven't used k-means or any other clustering there but
i am sure it is available there too.

Same goes for the sphere projections and pca.



On Mon, Oct 22, 2012 at 11:13 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> i meant, "soft clustering"
>
> On Mon, Oct 22, 2012 at 11:06 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> from Jira:
>>
>>> Hi Dmitriy, sorry for going a little off topic here, but could you elaborate on this? I've been experimenting with using either cosine or tanimoto distance on the USigma output of ssvd with -pca true. Are those not appropriate distance measures for the -pca output?
>>
>> Let somebody correct me if i am talking nonsense here...
>>
>> Strictly speaking, you can find clusters using L2 distance (i.e.
>> euclidean distance). In that case, PCA helps you by reducing
>> functionality, and then USigma output will preserve original distances
>> (or at least proportions of those). K means with L2 will then work a
>> little faster.
>>
>> But... with cosine and Tanimoto, PCA does not preserve those due to
>> recentering of the original data, therefore, dimensionality reduction
>> doesn't work as much for these types of things. Here you basically
>> have just to recourses: 1) do LSA ( in terms of SSVD, it means --pca
>> false and take U output for document topic space), or 2) perhaps do
>> sphere projection first and then do dimensionality reduction with
>> --pca true. the latter will at least preserve cosine distances as far
>> as i can tell. But standard way to address topical "sort clustering"
>> with text is still LSA. (if that's your goal, within Mahout realm i
>> probably also need to draw your attention to LDA-cvb method in Mahout,
>> various researches say LDA actually does better job in finding topic
>> mixtures).
>>
>> On Mon, Oct 22, 2012 at 7:29 AM, Matt Molek <mp...@gmail.com> wrote:
>>> I've done some more testing and submitted a JIRA:
>>> https://issues.apache.org/jira/browse/MAHOUT-1103
>>>
>>> On Sat, Oct 20, 2012 at 9:01 PM, Matt Molek <mp...@gmail.com> wrote:
>>>> Thanks for the quick response!
>>>>
>>>> I will do some testing tomorrow with various numbers of clusters and
>>>> create a JIRA once I have those results. I might be able to contribute
>>>> a patch for this if I have the time.
>>>>
>>>> On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan
>>>> <pa...@gmail.com> wrote:
>>>>> "So if that's correct, is that what's happening to me? Half of my
>>>>> clusters are being sent to the overlapping reducers? That seems like a
>>>>> big issue, making clusterpp pretty much useless for my purposes. I
>>>>> can't have documents randomly being sent to the wrong cluster's
>>>>> directory, especially not 50+% of them."
>>>>>
>>>>> This might be correct. I think this can occur if the number of clusters is
>>>>> large, and the testing was not done with so many clusters.
>>>>> Can you help a bit in testing some scenarios?
>>>>>
>>>>> a) Try reducing the number of clusters to 100 and then 50. The motto is to
>>>>> find the breaking point (number of clusters) after which the clusters start
>>>>> converging. If this is found, then we would be sure that the problem lies
>>>>> in the partitioner.
>>>>> b) If you want, try to use a different partitioner/s. The idea is to create
>>>>> as many reducer tasks as the number of ( non empty ) clusters found, so
>>>>> that vectors present in each cluster is in a separate file and later they
>>>>> are moved to their respective directories ( named on cluster id ).
>>>>>
>>>>> Please also create a JIRA for this.
>>>>> https://issues.apache.org/jira/browse/MAHOUT.
>>>>> And if you are interested, this would be a good starting point to
>>>>> contribute to Mahout also.
>>>>>
>>>>> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <mp...@gmail.com> wrote:
>>>>>
>>>>>> First off, thank you everyone for your help so far. This mailing list
>>>>>> has been a great help getting me up and running with Mahout
>>>>>>
>>>>>> Right now, I'm clustering a set of ~3M documents into 300 clusters.
>>>>>> Then I'm using clusterpp to split the documents up into directories
>>>>>> containing the vectors belonging to each cluster. After I perform the
>>>>>> clustering, clusterdump shows that each cluster has between ~800 and
>>>>>> ~200,000 documents. This isn't a great spread, but the point is that
>>>>>> none of the clusters are empty.
>>>>>>
>>>>>> Here are my commands:
>>>>>>
>>>>>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
>>>>>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
>>>>>> -k 300 -x 15 -cl -ow
>>>>>>
>>>>>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
>>>>>>
>>>>>> bin/mahout clusterpp -i pca-clusters -o bottom
>>>>>>
>>>>>>
>>>>>> Since none of my clusters are empty, I would expect clusterpp to
>>>>>> create 300 directories in "bottom", one for each cluster. Instead,
>>>>>> only 147 directories are created. The other 153 outputs are just empty
>>>>>> part-r-* files sitting in the "bottom" directory.
>>>>>>
>>>>>> I haven't found too much information when searching on this issue but
>>>>>> I did come across one mailing list post from a while back:
>>>>>>
>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3C4F3E52FC.7000000@windwardsolutions.com%3E
>>>>>>
>>>>>> In that discussion someone said, "If that is the only thing that is
>>>>>> contained in the part-r-* file [it had no vectors], then the reducer
>>>>>> responsible to write to that part-r-* file did not receive any input
>>>>>> records to write to it. This happens because the program uses the
>>>>>> default hash partitioner which sometimes maps records belonging to
>>>>>> different clusters to a same reducer; thus leaving some reducers
>>>>>> without any input records."
>>>>>>
>>>>>> So if that's correct, is that what's happening to me? Half of my
>>>>>> clusters are being sent to the overlapping reducers? That seems like a
>>>>>> big issue, making clusterpp pretty much useless for my purposes. I
>>>>>> can't have documents randomly being sent to the wrong cluster's
>>>>>> directory, especially not 50+% of them.
>>>>>>
>>>>>> One final detail: I'm not sure if this matters, but the clusters
>>>>>> output by kmeans are not numbered 1 to 300. They have an odd looking,
>>>>>> nonsequential numbering sequence. The first 5 clusters are:
>>>>>> VL-3740844
>>>>>> VL-3741044
>>>>>> VL-3741140
>>>>>> VL-3741161
>>>>>> VL-3741235
>>>>>>
>>>>>> I haven't done much with kmeans before, so I wasn't sure if this was
>>>>>> an unexpected behavior or not.
>>>>>>

Re: clusterpp is only writing directories for about half of my clusters.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
i meant, "soft clustering"

On Mon, Oct 22, 2012 at 11:06 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> from Jira:
>
>> Hi Dmitriy, sorry for going a little off topic here, but could you elaborate on this? I've been experimenting with using either cosine or tanimoto distance on the USigma output of ssvd with -pca true. Are those not appropriate distance measures for the -pca output?
>
> Let somebody correct me if i am talking nonsense here...
>
> Strictly speaking, you can find clusters using L2 distance (i.e.
> euclidean distance). In that case, PCA helps you by reducing
> functionality, and then USigma output will preserve original distances
> (or at least proportions of those). K means with L2 will then work a
> little faster.
>
> But... with cosine and Tanimoto, PCA does not preserve those due to
> recentering of the original data, therefore, dimensionality reduction
> doesn't work as much for these types of things. Here you basically
> have just to recourses: 1) do LSA ( in terms of SSVD, it means --pca
> false and take U output for document topic space), or 2) perhaps do
> sphere projection first and then do dimensionality reduction with
> --pca true. the latter will at least preserve cosine distances as far
> as i can tell. But standard way to address topical "sort clustering"
> with text is still LSA. (if that's your goal, within Mahout realm i
> probably also need to draw your attention to LDA-cvb method in Mahout,
> various researches say LDA actually does better job in finding topic
> mixtures).
>
> On Mon, Oct 22, 2012 at 7:29 AM, Matt Molek <mp...@gmail.com> wrote:
>> I've done some more testing and submitted a JIRA:
>> https://issues.apache.org/jira/browse/MAHOUT-1103
>>
>> On Sat, Oct 20, 2012 at 9:01 PM, Matt Molek <mp...@gmail.com> wrote:
>>> Thanks for the quick response!
>>>
>>> I will do some testing tomorrow with various numbers of clusters and
>>> create a JIRA once I have those results. I might be able to contribute
>>> a patch for this if I have the time.
>>>
>>> On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan
>>> <pa...@gmail.com> wrote:
>>>> "So if that's correct, is that what's happening to me? Half of my
>>>> clusters are being sent to the overlapping reducers? That seems like a
>>>> big issue, making clusterpp pretty much useless for my purposes. I
>>>> can't have documents randomly being sent to the wrong cluster's
>>>> directory, especially not 50+% of them."
>>>>
>>>> This might be correct. I think this can occur if the number of clusters is
>>>> large, and the testing was not done with so many clusters.
>>>> Can you help a bit in testing some scenarios?
>>>>
>>>> a) Try reducing the number of clusters to 100 and then 50. The motto is to
>>>> find the breaking point (number of clusters) after which the clusters start
>>>> converging. If this is found, then we would be sure that the problem lies
>>>> in the partitioner.
>>>> b) If you want, try to use a different partitioner/s. The idea is to create
>>>> as many reducer tasks as the number of ( non empty ) clusters found, so
>>>> that vectors present in each cluster is in a separate file and later they
>>>> are moved to their respective directories ( named on cluster id ).
>>>>
>>>> Please also create a JIRA for this.
>>>> https://issues.apache.org/jira/browse/MAHOUT.
>>>> And if you are interested, this would be a good starting point to
>>>> contribute to Mahout also.
>>>>
>>>> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <mp...@gmail.com> wrote:
>>>>
>>>>> First off, thank you everyone for your help so far. This mailing list
>>>>> has been a great help getting me up and running with Mahout
>>>>>
>>>>> Right now, I'm clustering a set of ~3M documents into 300 clusters.
>>>>> Then I'm using clusterpp to split the documents up into directories
>>>>> containing the vectors belonging to each cluster. After I perform the
>>>>> clustering, clusterdump shows that each cluster has between ~800 and
>>>>> ~200,000 documents. This isn't a great spread, but the point is that
>>>>> none of the clusters are empty.
>>>>>
>>>>> Here are my commands:
>>>>>
>>>>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
>>>>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
>>>>> -k 300 -x 15 -cl -ow
>>>>>
>>>>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
>>>>>
>>>>> bin/mahout clusterpp -i pca-clusters -o bottom
>>>>>
>>>>>
>>>>> Since none of my clusters are empty, I would expect clusterpp to
>>>>> create 300 directories in "bottom", one for each cluster. Instead,
>>>>> only 147 directories are created. The other 153 outputs are just empty
>>>>> part-r-* files sitting in the "bottom" directory.
>>>>>
>>>>> I haven't found too much information when searching on this issue but
>>>>> I did come across one mailing list post from a while back:
>>>>>
>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3C4F3E52FC.7000000@windwardsolutions.com%3E
>>>>>
>>>>> In that discussion someone said, "If that is the only thing that is
>>>>> contained in the part-r-* file [it had no vectors], then the reducer
>>>>> responsible to write to that part-r-* file did not receive any input
>>>>> records to write to it. This happens because the program uses the
>>>>> default hash partitioner which sometimes maps records belonging to
>>>>> different clusters to a same reducer; thus leaving some reducers
>>>>> without any input records."
>>>>>
>>>>> So if that's correct, is that what's happening to me? Half of my
>>>>> clusters are being sent to the overlapping reducers? That seems like a
>>>>> big issue, making clusterpp pretty much useless for my purposes. I
>>>>> can't have documents randomly being sent to the wrong cluster's
>>>>> directory, especially not 50+% of them.
>>>>>
>>>>> One final detail: I'm not sure if this matters, but the clusters
>>>>> output by kmeans are not numbered 1 to 300. They have an odd looking,
>>>>> nonsequential numbering sequence. The first 5 clusters are:
>>>>> VL-3740844
>>>>> VL-3741044
>>>>> VL-3741140
>>>>> VL-3741161
>>>>> VL-3741235
>>>>>
>>>>> I haven't done much with kmeans before, so I wasn't sure if this was
>>>>> an unexpected behavior or not.
>>>>>

Re: clusterpp is only writing directories for about half of my clusters.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
from Jira:

> Hi Dmitriy, sorry for going a little off topic here, but could you elaborate on this? I've been experimenting with using either cosine or tanimoto distance on the USigma output of ssvd with -pca true. Are those not appropriate distance measures for the -pca output?

Let somebody correct me if i am talking nonsense here...

Strictly speaking, you can find clusters using L2 distance (i.e.
euclidean distance). In that case, PCA helps you by reducing
functionality, and then USigma output will preserve original distances
(or at least proportions of those). K means with L2 will then work a
little faster.

But... with cosine and Tanimoto, PCA does not preserve those due to
recentering of the original data, therefore, dimensionality reduction
doesn't work as much for these types of things. Here you basically
have just to recourses: 1) do LSA ( in terms of SSVD, it means --pca
false and take U output for document topic space), or 2) perhaps do
sphere projection first and then do dimensionality reduction with
--pca true. the latter will at least preserve cosine distances as far
as i can tell. But standard way to address topical "sort clustering"
with text is still LSA. (if that's your goal, within Mahout realm i
probably also need to draw your attention to LDA-cvb method in Mahout,
various researches say LDA actually does better job in finding topic
mixtures).

On Mon, Oct 22, 2012 at 7:29 AM, Matt Molek <mp...@gmail.com> wrote:
> I've done some more testing and submitted a JIRA:
> https://issues.apache.org/jira/browse/MAHOUT-1103
>
> On Sat, Oct 20, 2012 at 9:01 PM, Matt Molek <mp...@gmail.com> wrote:
>> Thanks for the quick response!
>>
>> I will do some testing tomorrow with various numbers of clusters and
>> create a JIRA once I have those results. I might be able to contribute
>> a patch for this if I have the time.
>>
>> On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan
>> <pa...@gmail.com> wrote:
>>> "So if that's correct, is that what's happening to me? Half of my
>>> clusters are being sent to the overlapping reducers? That seems like a
>>> big issue, making clusterpp pretty much useless for my purposes. I
>>> can't have documents randomly being sent to the wrong cluster's
>>> directory, especially not 50+% of them."
>>>
>>> This might be correct. I think this can occur if the number of clusters is
>>> large, and the testing was not done with so many clusters.
>>> Can you help a bit in testing some scenarios?
>>>
>>> a) Try reducing the number of clusters to 100 and then 50. The motto is to
>>> find the breaking point (number of clusters) after which the clusters start
>>> converging. If this is found, then we would be sure that the problem lies
>>> in the partitioner.
>>> b) If you want, try to use a different partitioner/s. The idea is to create
>>> as many reducer tasks as the number of ( non empty ) clusters found, so
>>> that vectors present in each cluster is in a separate file and later they
>>> are moved to their respective directories ( named on cluster id ).
>>>
>>> Please also create a JIRA for this.
>>> https://issues.apache.org/jira/browse/MAHOUT.
>>> And if you are interested, this would be a good starting point to
>>> contribute to Mahout also.
>>>
>>> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <mp...@gmail.com> wrote:
>>>
>>>> First off, thank you everyone for your help so far. This mailing list
>>>> has been a great help getting me up and running with Mahout
>>>>
>>>> Right now, I'm clustering a set of ~3M documents into 300 clusters.
>>>> Then I'm using clusterpp to split the documents up into directories
>>>> containing the vectors belonging to each cluster. After I perform the
>>>> clustering, clusterdump shows that each cluster has between ~800 and
>>>> ~200,000 documents. This isn't a great spread, but the point is that
>>>> none of the clusters are empty.
>>>>
>>>> Here are my commands:
>>>>
>>>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
>>>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
>>>> -k 300 -x 15 -cl -ow
>>>>
>>>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
>>>>
>>>> bin/mahout clusterpp -i pca-clusters -o bottom
>>>>
>>>>
>>>> Since none of my clusters are empty, I would expect clusterpp to
>>>> create 300 directories in "bottom", one for each cluster. Instead,
>>>> only 147 directories are created. The other 153 outputs are just empty
>>>> part-r-* files sitting in the "bottom" directory.
>>>>
>>>> I haven't found too much information when searching on this issue but
>>>> I did come across one mailing list post from a while back:
>>>>
>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3C4F3E52FC.7000000@windwardsolutions.com%3E
>>>>
>>>> In that discussion someone said, "If that is the only thing that is
>>>> contained in the part-r-* file [it had no vectors], then the reducer
>>>> responsible to write to that part-r-* file did not receive any input
>>>> records to write to it. This happens because the program uses the
>>>> default hash partitioner which sometimes maps records belonging to
>>>> different clusters to a same reducer; thus leaving some reducers
>>>> without any input records."
>>>>
>>>> So if that's correct, is that what's happening to me? Half of my
>>>> clusters are being sent to the overlapping reducers? That seems like a
>>>> big issue, making clusterpp pretty much useless for my purposes. I
>>>> can't have documents randomly being sent to the wrong cluster's
>>>> directory, especially not 50+% of them.
>>>>
>>>> One final detail: I'm not sure if this matters, but the clusters
>>>> output by kmeans are not numbered 1 to 300. They have an odd looking,
>>>> nonsequential numbering sequence. The first 5 clusters are:
>>>> VL-3740844
>>>> VL-3741044
>>>> VL-3741140
>>>> VL-3741161
>>>> VL-3741235
>>>>
>>>> I haven't done much with kmeans before, so I wasn't sure if this was
>>>> an unexpected behavior or not.
>>>>

Re: clusterpp is only writing directories for about half of my clusters.

Posted by Matt Molek <mp...@gmail.com>.
I've done some more testing and submitted a JIRA:
https://issues.apache.org/jira/browse/MAHOUT-1103

On Sat, Oct 20, 2012 at 9:01 PM, Matt Molek <mp...@gmail.com> wrote:
> Thanks for the quick response!
>
> I will do some testing tomorrow with various numbers of clusters and
> create a JIRA once I have those results. I might be able to contribute
> a patch for this if I have the time.
>
> On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan
> <pa...@gmail.com> wrote:
>> "So if that's correct, is that what's happening to me? Half of my
>> clusters are being sent to the overlapping reducers? That seems like a
>> big issue, making clusterpp pretty much useless for my purposes. I
>> can't have documents randomly being sent to the wrong cluster's
>> directory, especially not 50+% of them."
>>
>> This might be correct. I think this can occur if the number of clusters is
>> large, and the testing was not done with so many clusters.
>> Can you help a bit in testing some scenarios?
>>
>> a) Try reducing the number of clusters to 100 and then 50. The motto is to
>> find the breaking point (number of clusters) after which the clusters start
>> converging. If this is found, then we would be sure that the problem lies
>> in the partitioner.
>> b) If you want, try to use a different partitioner/s. The idea is to create
>> as many reducer tasks as the number of ( non empty ) clusters found, so
>> that vectors present in each cluster is in a separate file and later they
>> are moved to their respective directories ( named on cluster id ).
>>
>> Please also create a JIRA for this.
>> https://issues.apache.org/jira/browse/MAHOUT.
>> And if you are interested, this would be a good starting point to
>> contribute to Mahout also.
>>
>> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <mp...@gmail.com> wrote:
>>
>>> First off, thank you everyone for your help so far. This mailing list
>>> has been a great help getting me up and running with Mahout
>>>
>>> Right now, I'm clustering a set of ~3M documents into 300 clusters.
>>> Then I'm using clusterpp to split the documents up into directories
>>> containing the vectors belonging to each cluster. After I perform the
>>> clustering, clusterdump shows that each cluster has between ~800 and
>>> ~200,000 documents. This isn't a great spread, but the point is that
>>> none of the clusters are empty.
>>>
>>> Here are my commands:
>>>
>>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
>>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
>>> -k 300 -x 15 -cl -ow
>>>
>>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
>>>
>>> bin/mahout clusterpp -i pca-clusters -o bottom
>>>
>>>
>>> Since none of my clusters are empty, I would expect clusterpp to
>>> create 300 directories in "bottom", one for each cluster. Instead,
>>> only 147 directories are created. The other 153 outputs are just empty
>>> part-r-* files sitting in the "bottom" directory.
>>>
>>> I haven't found too much information when searching on this issue but
>>> I did come across one mailing list post from a while back:
>>>
>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3C4F3E52FC.7000000@windwardsolutions.com%3E
>>>
>>> In that discussion someone said, "If that is the only thing that is
>>> contained in the part-r-* file [it had no vectors], then the reducer
>>> responsible to write to that part-r-* file did not receive any input
>>> records to write to it. This happens because the program uses the
>>> default hash partitioner which sometimes maps records belonging to
>>> different clusters to a same reducer; thus leaving some reducers
>>> without any input records."
>>>
>>> So if that's correct, is that what's happening to me? Half of my
>>> clusters are being sent to the overlapping reducers? That seems like a
>>> big issue, making clusterpp pretty much useless for my purposes. I
>>> can't have documents randomly being sent to the wrong cluster's
>>> directory, especially not 50+% of them.
>>>
>>> One final detail: I'm not sure if this matters, but the clusters
>>> output by kmeans are not numbered 1 to 300. They have an odd looking,
>>> nonsequential numbering sequence. The first 5 clusters are:
>>> VL-3740844
>>> VL-3741044
>>> VL-3741140
>>> VL-3741161
>>> VL-3741235
>>>
>>> I haven't done much with kmeans before, so I wasn't sure if this was
>>> an unexpected behavior or not.
>>>

Re: clusterpp is only writing directories for about half of my clusters.

Posted by Matt Molek <mp...@gmail.com>.
Thanks for the quick response!

I will do some testing tomorrow with various numbers of clusters and
create a JIRA once I have those results. I might be able to contribute
a patch for this if I have the time.

On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan
<pa...@gmail.com> wrote:
> "So if that's correct, is that what's happening to me? Half of my
> clusters are being sent to the overlapping reducers? That seems like a
> big issue, making clusterpp pretty much useless for my purposes. I
> can't have documents randomly being sent to the wrong cluster's
> directory, especially not 50+% of them."
>
> This might be correct. I think this can occur if the number of clusters is
> large, and the testing was not done with so many clusters.
> Can you help a bit in testing some scenarios?
>
> a) Try reducing the number of clusters to 100 and then 50. The motto is to
> find the breaking point (number of clusters) after which the clusters start
> converging. If this is found, then we would be sure that the problem lies
> in the partitioner.
> b) If you want, try to use a different partitioner/s. The idea is to create
> as many reducer tasks as the number of ( non empty ) clusters found, so
> that vectors present in each cluster is in a separate file and later they
> are moved to their respective directories ( named on cluster id ).
>
> Please also create a JIRA for this.
> https://issues.apache.org/jira/browse/MAHOUT.
> And if you are interested, this would be a good starting point to
> contribute to Mahout also.
>
> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <mp...@gmail.com> wrote:
>
>> First off, thank you everyone for your help so far. This mailing list
>> has been a great help getting me up and running with Mahout
>>
>> Right now, I'm clustering a set of ~3M documents into 300 clusters.
>> Then I'm using clusterpp to split the documents up into directories
>> containing the vectors belonging to each cluster. After I perform the
>> clustering, clusterdump shows that each cluster has between ~800 and
>> ~200,000 documents. This isn't a great spread, but the point is that
>> none of the clusters are empty.
>>
>> Here are my commands:
>>
>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
>> -k 300 -x 15 -cl -ow
>>
>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
>>
>> bin/mahout clusterpp -i pca-clusters -o bottom
>>
>>
>> Since none of my clusters are empty, I would expect clusterpp to
>> create 300 directories in "bottom", one for each cluster. Instead,
>> only 147 directories are created. The other 153 outputs are just empty
>> part-r-* files sitting in the "bottom" directory.
>>
>> I haven't found too much information when searching on this issue but
>> I did come across one mailing list post from a while back:
>>
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3C4F3E52FC.7000000@windwardsolutions.com%3E
>>
>> In that discussion someone said, "If that is the only thing that is
>> contained in the part-r-* file [it had no vectors], then the reducer
>> responsible to write to that part-r-* file did not receive any input
>> records to write to it. This happens because the program uses the
>> default hash partitioner which sometimes maps records belonging to
>> different clusters to a same reducer; thus leaving some reducers
>> without any input records."
>>
>> So if that's correct, is that what's happening to me? Half of my
>> clusters are being sent to the overlapping reducers? That seems like a
>> big issue, making clusterpp pretty much useless for my purposes. I
>> can't have documents randomly being sent to the wrong cluster's
>> directory, especially not 50+% of them.
>>
>> One final detail: I'm not sure if this matters, but the clusters
>> output by kmeans are not numbered 1 to 300. They have an odd looking,
>> nonsequential numbering sequence. The first 5 clusters are:
>> VL-3740844
>> VL-3741044
>> VL-3741140
>> VL-3741161
>> VL-3741235
>>
>> I haven't done much with kmeans before, so I wasn't sure if this was
>> an unexpected behavior or not.
>>

Re: clusterpp is only writing directories for about half of my clusters.

Posted by paritosh ranjan <pa...@gmail.com>.
"So if that's correct, is that what's happening to me? Half of my
clusters are being sent to the overlapping reducers? That seems like a
big issue, making clusterpp pretty much useless for my purposes. I
can't have documents randomly being sent to the wrong cluster's
directory, especially not 50+% of them."

This might be correct. I think this can occur if the number of clusters is
large, and the testing was not done with so many clusters.
Can you help a bit in testing some scenarios?

a) Try reducing the number of clusters to 100 and then 50. The motto is to
find the breaking point (number of clusters) after which the clusters start
converging. If this is found, then we would be sure that the problem lies
in the partitioner.
b) If you want, try to use a different partitioner/s. The idea is to create
as many reducer tasks as the number of ( non empty ) clusters found, so
that vectors present in each cluster is in a separate file and later they
are moved to their respective directories ( named on cluster id ).

Please also create a JIRA for this.
https://issues.apache.org/jira/browse/MAHOUT.
And if you are interested, this would be a good starting point to
contribute to Mahout also.

On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <mp...@gmail.com> wrote:

> First off, thank you everyone for your help so far. This mailing list
> has been a great help getting me up and running with Mahout
>
> Right now, I'm clustering a set of ~3M documents into 300 clusters.
> Then I'm using clusterpp to split the documents up into directories
> containing the vectors belonging to each cluster. After I perform the
> clustering, clusterdump shows that each cluster has between ~800 and
> ~200,000 documents. This isn't a great spread, but the point is that
> none of the clusters are empty.
>
> Here are my commands:
>
> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
> -k 300 -x 15 -cl -ow
>
> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
>
> bin/mahout clusterpp -i pca-clusters -o bottom
>
>
> Since none of my clusters are empty, I would expect clusterpp to
> create 300 directories in "bottom", one for each cluster. Instead,
> only 147 directories are created. The other 153 outputs are just empty
> part-r-* files sitting in the "bottom" directory.
>
> I haven't found too much information when searching on this issue but
> I did come across one mailing list post from a while back:
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3C4F3E52FC.7000000@windwardsolutions.com%3E
>
> In that discussion someone said, "If that is the only thing that is
> contained in the part-r-* file [it had no vectors], then the reducer
> responsible to write to that part-r-* file did not receive any input
> records to write to it. This happens because the program uses the
> default hash partitioner which sometimes maps records belonging to
> different clusters to a same reducer; thus leaving some reducers
> without any input records."
>
> So if that's correct, is that what's happening to me? Half of my
> clusters are being sent to the overlapping reducers? That seems like a
> big issue, making clusterpp pretty much useless for my purposes. I
> can't have documents randomly being sent to the wrong cluster's
> directory, especially not 50+% of them.
>
> One final detail: I'm not sure if this matters, but the clusters
> output by kmeans are not numbered 1 to 300. They have an odd looking,
> nonsequential numbering sequence. The first 5 clusters are:
> VL-3740844
> VL-3741044
> VL-3741140
> VL-3741161
> VL-3741235
>
> I haven't done much with kmeans before, so I wasn't sure if this was
> an unexpected behavior or not.
>

Re: clusterpp is only writing directories for about half of my clusters.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Depends on what you do. E.g. you may find it difficult running svd on
a single node machine even for a 2G worth of matrix input. Some of the
stuff is cpu bound. another stuff might be iteration bound (als) but
still worth a try if you can figure good learning and reg rates. Some
stuff (ALS, again) you may find actually performs much better in BSP
environment than in MapReduce altogether.

So it depends on a problem, but as a general rule of thumb, if you
can't solve your problem on a single node in an hour (depends on your
requirements), that's probably when you might want to start trying a
machine cluster solution.

I'd always suggest to prototype in R first. If not for the volume sake
then just to make sure it does make sense for your data. If it looks
like it takes forever for your size, then you may want start looking
elsewhere.

On Sat, Oct 20, 2012 at 1:25 PM, Eric Link <er...@ericmlink.com> wrote:
> We are looking at using mahout in our organization.  We have a need to do statistical analysis and do clustering and make recommendations.  What is the 'sweet spot' for doing this with mahout?  Meaning, what types of data sets and data volumes are the best fit for using a tool like mahout, versus doing things, say,  in a sql database.  I hear big data doesn't really start until you have terabytes and petabytes of data, so I'm not sure the data sets I have are worthy!    Thanks for any thoughts on the proper fit for a tool like mahout.    - Eric
>
>
>
> On Oct 20, 2012, at 2:44 PM, Matt Molek <mp...@gmail.com> wrote:
>
>> First off, thank you everyone for your help so far. This mailing list
>> has been a great help getting me up and running with Mahout
>>
>> Right now, I'm clustering a set of ~3M documents into 300 clusters.
>> Then I'm using clusterpp to split the documents up into directories
>> containing the vectors belonging to each cluster. After I perform the
>> clustering, clusterdump shows that each cluster has between ~800 and
>> ~200,000 documents. This isn't a great spread, but the point is that
>> none of the clusters are empty.
>>
>> Here are my commands:
>>
>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
>> -k 300 -x 15 -cl -ow
>>
>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
>>
>> bin/mahout clusterpp -i pca-clusters -o bottom
>>
>>
>> Since none of my clusters are empty, I would expect clusterpp to
>> create 300 directories in "bottom", one for each cluster. Instead,
>> only 147 directories are created. The other 153 outputs are just empty
>> part-r-* files sitting in the "bottom" directory.
>>
>> I haven't found too much information when searching on this issue but
>> I did come across one mailing list post from a while back:
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3C4F3E52FC.7000000@windwardsolutions.com%3E
>>
>> In that discussion someone said, "If that is the only thing that is
>> contained in the part-r-* file [it had no vectors], then the reducer
>> responsible to write to that part-r-* file did not receive any input
>> records to write to it. This happens because the program uses the
>> default hash partitioner which sometimes maps records belonging to
>> different clusters to a same reducer; thus leaving some reducers
>> without any input records."
>>
>> So if that's correct, is that what's happening to me? Half of my
>> clusters are being sent to the overlapping reducers? That seems like a
>> big issue, making clusterpp pretty much useless for my purposes. I
>> can't have documents randomly being sent to the wrong cluster's
>> directory, especially not 50+% of them.
>>
>> One final detail: I'm not sure if this matters, but the clusters
>> output by kmeans are not numbered 1 to 300. They have an odd looking,
>> nonsequential numbering sequence. The first 5 clusters are:
>> VL-3740844
>> VL-3741044
>> VL-3741140
>> VL-3741161
>> VL-3741235
>>
>> I haven't done much with kmeans before, so I wasn't sure if this was
>> an unexpected behavior or not.
>

Re: clusterpp is only writing directories for about half of my clusters.

Posted by Eric Link <er...@ericmlink.com>.
We are looking at using mahout in our organization.  We have a need to do statistical analysis and do clustering and make recommendations.  What is the 'sweet spot' for doing this with mahout?  Meaning, what types of data sets and data volumes are the best fit for using a tool like mahout, versus doing things, say,  in a sql database.  I hear big data doesn't really start until you have terabytes and petabytes of data, so I'm not sure the data sets I have are worthy!    Thanks for any thoughts on the proper fit for a tool like mahout.    - Eric



On Oct 20, 2012, at 2:44 PM, Matt Molek <mp...@gmail.com> wrote:

> First off, thank you everyone for your help so far. This mailing list
> has been a great help getting me up and running with Mahout
> 
> Right now, I'm clustering a set of ~3M documents into 300 clusters.
> Then I'm using clusterpp to split the documents up into directories
> containing the vectors belonging to each cluster. After I perform the
> clustering, clusterdump shows that each cluster has between ~800 and
> ~200,000 documents. This isn't a great spread, but the point is that
> none of the clusters are empty.
> 
> Here are my commands:
> 
> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
> -k 300 -x 15 -cl -ow
> 
> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
> 
> bin/mahout clusterpp -i pca-clusters -o bottom
> 
> 
> Since none of my clusters are empty, I would expect clusterpp to
> create 300 directories in "bottom", one for each cluster. Instead,
> only 147 directories are created. The other 153 outputs are just empty
> part-r-* files sitting in the "bottom" directory.
> 
> I haven't found too much information when searching on this issue but
> I did come across one mailing list post from a while back:
> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%3C4F3E52FC.7000000@windwardsolutions.com%3E
> 
> In that discussion someone said, "If that is the only thing that is
> contained in the part-r-* file [it had no vectors], then the reducer
> responsible to write to that part-r-* file did not receive any input
> records to write to it. This happens because the program uses the
> default hash partitioner which sometimes maps records belonging to
> different clusters to a same reducer; thus leaving some reducers
> without any input records."
> 
> So if that's correct, is that what's happening to me? Half of my
> clusters are being sent to the overlapping reducers? That seems like a
> big issue, making clusterpp pretty much useless for my purposes. I
> can't have documents randomly being sent to the wrong cluster's
> directory, especially not 50+% of them.
> 
> One final detail: I'm not sure if this matters, but the clusters
> output by kmeans are not numbered 1 to 300. They have an odd looking,
> nonsequential numbering sequence. The first 5 clusters are:
> VL-3740844
> VL-3741044
> VL-3741140
> VL-3741161
> VL-3741235
> 
> I haven't done much with kmeans before, so I wasn't sure if this was
> an unexpected behavior or not.