You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/06/26 17:32:55 UTC

KMeansJob vs KMeansDriver

Isn't the KMeansJob pretty much redundant, assuming we add a parameter  
to KMeansDriver to take in the number of reduce tasks?

Also, the variable naming in KMeansJob that the number of reduce tasks  
(numCentroids) is actually the "k" in k-Means, even if this value is  
currently fixed at 2 if using KMeansDriver?  I'm trying to make arg  
handling easier for MAHOUT-138.

I'm confused.

-Grant

Re: KMeansJob vs KMeansDriver

Posted by Grant Ingersoll <gs...@apache.org>.
On Jun 26, 2009, at 11:32 AM, Grant Ingersoll wrote:

> Isn't the KMeansJob pretty much redundant, assuming we add a  
> parameter to KMeansDriver to take in the number of reduce tasks?
>
> Also, the variable naming in KMeansJob that the number of reduce  
> tasks (numCentroids) is actually the "k" in k-Means, even if this  
> value is currently fixed at 2 if using KMeansDriver?  I'm trying to  
> make arg handling easier for MAHOUT-138.
>

It also deletes the OutputPath if it exists.

I'm going to delete the Job file and fold this functionality into  
KMDriver.

-Grant

Re: KMeansJob vs KMeansDriver

Posted by Ted Dunning <te...@gmail.com>.
Of course, this should support assigning *any* input to clusters, not just
the original input.

On Fri, Jun 26, 2009 at 9:32 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> 2. Optionally cluster the input data points by assigning them to clusters.
> This would be with probabilities in the case of FuzzyKMeans and Dirichlet or
> one might just desire the most likely cluster.

Re: KMeansJob vs KMeansDriver

Posted by Grant Ingersoll <gs...@apache.org>.
On Jun 26, 2009, at 12:32 PM, Jeff Eastman wrote:

> Grant Ingersoll wrote:
>> Isn't the KMeansJob pretty much redundant, assuming we add a  
>> parameter to KMeansDriver to take in the number of reduce tasks?
> The purpose of the clustering jobs, in general, was to simplify  
> computing the clusters and then clustering the data. It has been  
> applied - and changed - inconsistently over the various  
> implementations and some cleanup is warranted. It seems to me that  
> having a job to do both steps is still valuable, though (as in the  
> earlier kmeans synthetic control example) it may do the point  
> clustering unnecessarily if it is blindly used as only entry point.

OK, but in this case, the KMJob actually takes in more parameters, not  
less.  BTW, this is not the same Job as the one used by the synthetic  
control example, which I agree is more usable.

>
> I don't currently see how specifying the 'k' value explicitly can  
> work in the current job and it is unrelated to the number of  
> reducers. The 'k' value comes from the initial number of clusters. I  
> think the implementation can use any number of reducers up to 'k'  
> but don't recall seeing a test for that. One could add a job step  
> that picks 'k' random centers from the data - as in your previous  
> threads - and that job/driver would need to know 'k'. See below.

I have added the Random capability (locally).  I'll put up my patch  
shortly.

>
> For consistency, it seems to me that all the clustering jobs should  
> uniformly facilitate these actions:
> 0. Set the initial clustering state
> 1. Compute a set of clusters given the input data points and the  
> initial clustering state
> 2. Optionally cluster the input data points by assigning them to  
> clusters. This would be with probabilities in the case of  
> FuzzyKMeans and Dirichlet or one might just desire the most likely  
> cluster.
>
> Canopy has no initial clustering state. For KMeans, this can be  
> computed via running Canopy on the data or by selecting 'k' random  
> points from the data, or by some other heuristic (un)related to the  
> data. For Dirichlet, it is by sampling from the prior of the  
> ModelDistribution; for MeanShift every input data point creates an  
> initial canopy.
>
> (The various jobs, drivers and output directory structures produced  
> by the different algorithms need to be cleaned up and made more  
> consistent, IMO)
>>
>> Also, the variable naming in KMeansJob that the number of reduce  
>> tasks (numCentroids) is actually the "k" in k-Means, even if this  
>> value is currently fixed at 2 if using KMeansDriver?  I'm trying to  
>> make arg handling easier for MAHOUT-138.
> I thought I had already committed a change to rename this argument  
> numReduceTasks so as to be consistent with its application in  
> KMeansDriver.

In KMD it's called numReduceTasks, in KMJ it's called numCentroids,  
which is what threw me.

I think with the new command line handling approach, the input  
parameters become much more descriptive, which will make it easier for  
people to consume the various drivers.

-Grant

Re: KMeansJob vs KMeansDriver

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
I didn't notice the --clusters option just reading the patch. If that 
puts the clusters into a specific directory then fine. I was suggesting 
the default be $output/state rather than currently just writing them all 
to $output.

If you want some help I'm available some before next week then more. How 
about you do Canopy and KMeans and I do the others, since those seem to 
be in your critical path at the current time.

Jeff

Grant Ingersoll wrote:
>
> On Jun 26, 2009, at 3:04 PM, Jeff Eastman wrote:
>
>> That looks reasonable, just reading the patch. You might also want to 
>> put the clusters-x files into a state (or clusters) sub-directory to 
>> reduce noise in the output directory and improve consistency with MS 
>> and Dirichlet (which do not themselves agree on which directory name 
>> to use).
>
> The --clusters option allows them to specify the path.  Or, are you 
> suggesting there be a default of $output/clusters/
>
> For M-138, I'd like to convert all the drivers over to use CLI2, (help 
> appreciated!)
>
>


Re: KMeansJob vs KMeansDriver

Posted by Grant Ingersoll <gs...@apache.org>.
On Jun 26, 2009, at 3:04 PM, Jeff Eastman wrote:

> That looks reasonable, just reading the patch. You might also want  
> to put the clusters-x files into a state (or clusters) sub-directory  
> to reduce noise in the output directory and improve consistency with  
> MS and Dirichlet (which do not themselves agree on which directory  
> name to use).

The --clusters option allows them to specify the path.  Or, are you  
suggesting there be a default of $output/clusters/

For M-138, I'd like to convert all the drivers over to use CLI2, (help  
appreciated!)

Re: KMeansJob vs KMeansDriver

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
That looks reasonable, just reading the patch. You might also want to 
put the clusters-x files into a state (or clusters) sub-directory to 
reduce noise in the output directory and improve consistency with MS and 
Dirichlet (which do not themselves agree on which directory name to use).


Grant Ingersoll wrote:
> Check out the patch I just put up on M-138
>
> On Jun 26, 2009, at 12:32 PM, Jeff Eastman wrote:
>
>> Grant Ingersoll wrote:
>>> Isn't the KMeansJob pretty much redundant, assuming we add a 
>>> parameter to KMeansDriver to take in the number of reduce tasks?
>> The purpose of the clustering jobs, in general, was to simplify 
>> computing the clusters and then clustering the data. It has been 
>> applied - and changed - inconsistently over the various 
>> implementations and some cleanup is warranted. It seems to me that 
>> having a job to do both steps is still valuable, though (as in the 
>> earlier kmeans synthetic control example) it may do the point 
>> clustering unnecessarily if it is blindly used as only entry point.
>>
>> I don't currently see how specifying the 'k' value explicitly can 
>> work in the current job and it is unrelated to the number of 
>> reducers. The 'k' value comes from the initial number of clusters. I 
>> think the implementation can use any number of reducers up to 'k' but 
>> don't recall seeing a test for that. One could add a job step that 
>> picks 'k' random centers from the data - as in your previous threads 
>> - and that job/driver would need to know 'k'. See below.
>>
>> For consistency, it seems to me that all the clustering jobs should 
>> uniformly facilitate these actions:
>> 0. Set the initial clustering state
>> 1. Compute a set of clusters given the input data points and the 
>> initial clustering state
>> 2. Optionally cluster the input data points by assigning them to 
>> clusters. This would be with probabilities in the case of FuzzyKMeans 
>> and Dirichlet or one might just desire the most likely cluster.
>>
>> Canopy has no initial clustering state. For KMeans, this can be 
>> computed via running Canopy on the data or by selecting 'k' random 
>> points from the data, or by some other heuristic (un)related to the 
>> data. For Dirichlet, it is by sampling from the prior of the 
>> ModelDistribution; for MeanShift every input data point creates an 
>> initial canopy.
>>
>> (The various jobs, drivers and output directory structures produced 
>> by the different algorithms need to be cleaned up and made more 
>> consistent, IMO)
>>>
>>> Also, the variable naming in KMeansJob that the number of reduce 
>>> tasks (numCentroids) is actually the "k" in k-Means, even if this 
>>> value is currently fixed at 2 if using KMeansDriver?  I'm trying to 
>>> make arg handling easier for MAHOUT-138.
>> I thought I had already committed a change to rename this argument 
>> numReduceTasks so as to be consistent with its application in 
>> KMeansDriver.
>>
>> Jeff
>
>
>
>


Re: KMeansJob vs KMeansDriver

Posted by Grant Ingersoll <gs...@apache.org>.
Check out the patch I just put up on M-138

On Jun 26, 2009, at 12:32 PM, Jeff Eastman wrote:

> Grant Ingersoll wrote:
>> Isn't the KMeansJob pretty much redundant, assuming we add a  
>> parameter to KMeansDriver to take in the number of reduce tasks?
> The purpose of the clustering jobs, in general, was to simplify  
> computing the clusters and then clustering the data. It has been  
> applied - and changed - inconsistently over the various  
> implementations and some cleanup is warranted. It seems to me that  
> having a job to do both steps is still valuable, though (as in the  
> earlier kmeans synthetic control example) it may do the point  
> clustering unnecessarily if it is blindly used as only entry point.
>
> I don't currently see how specifying the 'k' value explicitly can  
> work in the current job and it is unrelated to the number of  
> reducers. The 'k' value comes from the initial number of clusters. I  
> think the implementation can use any number of reducers up to 'k'  
> but don't recall seeing a test for that. One could add a job step  
> that picks 'k' random centers from the data - as in your previous  
> threads - and that job/driver would need to know 'k'. See below.
>
> For consistency, it seems to me that all the clustering jobs should  
> uniformly facilitate these actions:
> 0. Set the initial clustering state
> 1. Compute a set of clusters given the input data points and the  
> initial clustering state
> 2. Optionally cluster the input data points by assigning them to  
> clusters. This would be with probabilities in the case of  
> FuzzyKMeans and Dirichlet or one might just desire the most likely  
> cluster.
>
> Canopy has no initial clustering state. For KMeans, this can be  
> computed via running Canopy on the data or by selecting 'k' random  
> points from the data, or by some other heuristic (un)related to the  
> data. For Dirichlet, it is by sampling from the prior of the  
> ModelDistribution; for MeanShift every input data point creates an  
> initial canopy.
>
> (The various jobs, drivers and output directory structures produced  
> by the different algorithms need to be cleaned up and made more  
> consistent, IMO)
>>
>> Also, the variable naming in KMeansJob that the number of reduce  
>> tasks (numCentroids) is actually the "k" in k-Means, even if this  
>> value is currently fixed at 2 if using KMeansDriver?  I'm trying to  
>> make arg handling easier for MAHOUT-138.
> I thought I had already committed a change to rename this argument  
> numReduceTasks so as to be consistent with its application in  
> KMeansDriver.
>
> Jeff



Re: KMeansJob vs KMeansDriver

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Grant Ingersoll wrote:
> Isn't the KMeansJob pretty much redundant, assuming we add a parameter 
> to KMeansDriver to take in the number of reduce tasks?
The purpose of the clustering jobs, in general, was to simplify 
computing the clusters and then clustering the data. It has been applied 
- and changed - inconsistently over the various implementations and some 
cleanup is warranted. It seems to me that having a job to do both steps 
is still valuable, though (as in the earlier kmeans synthetic control 
example) it may do the point clustering unnecessarily if it is blindly 
used as only entry point.

I don't currently see how specifying the 'k' value explicitly can work 
in the current job and it is unrelated to the number of reducers. The 
'k' value comes from the initial number of clusters. I think the 
implementation can use any number of reducers up to 'k' but don't recall 
seeing a test for that. One could add a job step that picks 'k' random 
centers from the data - as in your previous threads - and that 
job/driver would need to know 'k'. See below.

For consistency, it seems to me that all the clustering jobs should 
uniformly facilitate these actions:
0. Set the initial clustering state
1. Compute a set of clusters given the input data points and the initial 
clustering state
2. Optionally cluster the input data points by assigning them to 
clusters. This would be with probabilities in the case of FuzzyKMeans 
and Dirichlet or one might just desire the most likely cluster.

Canopy has no initial clustering state. For KMeans, this can be computed 
via running Canopy on the data or by selecting 'k' random points from 
the data, or by some other heuristic (un)related to the data. For 
Dirichlet, it is by sampling from the prior of the ModelDistribution; 
for MeanShift every input data point creates an initial canopy.

(The various jobs, drivers and output directory structures produced by 
the different algorithms need to be cleaned up and made more consistent, 
IMO)
>
> Also, the variable naming in KMeansJob that the number of reduce tasks 
> (numCentroids) is actually the "k" in k-Means, even if this value is 
> currently fixed at 2 if using KMeansDriver?  I'm trying to make arg 
> handling easier for MAHOUT-138.
I thought I had already committed a change to rename this argument 
numReduceTasks so as to be consistent with its application in KMeansDriver.

Jeff