You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/06/26 17:32:55 UTC
KMeansJob vs KMeansDriver
Isn't the KMeansJob pretty much redundant, assuming we add a parameter
to KMeansDriver to take in the number of reduce tasks?
Also, the variable naming in KMeansJob that the number of reduce tasks
(numCentroids) is actually the "k" in k-Means, even if this value is
currently fixed at 2 if using KMeansDriver? I'm trying to make arg
handling easier for MAHOUT-138.
I'm confused.
-Grant
Re: KMeansJob vs KMeansDriver
Posted by Grant Ingersoll <gs...@apache.org>.
On Jun 26, 2009, at 11:32 AM, Grant Ingersoll wrote:
> Isn't the KMeansJob pretty much redundant, assuming we add a
> parameter to KMeansDriver to take in the number of reduce tasks?
>
> Also, the variable naming in KMeansJob that the number of reduce
> tasks (numCentroids) is actually the "k" in k-Means, even if this
> value is currently fixed at 2 if using KMeansDriver? I'm trying to
> make arg handling easier for MAHOUT-138.
>
It also deletes the OutputPath if it exists.
I'm going to delete the Job file and fold this functionality into
KMDriver.
-Grant
Re: KMeansJob vs KMeansDriver
Posted by Ted Dunning <te...@gmail.com>.
Of course, this should support assigning *any* input to clusters, not just
the original input.
On Fri, Jun 26, 2009 at 9:32 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
> 2. Optionally cluster the input data points by assigning them to clusters.
> This would be with probabilities in the case of FuzzyKMeans and Dirichlet or
> one might just desire the most likely cluster.
Re: KMeansJob vs KMeansDriver
Posted by Grant Ingersoll <gs...@apache.org>.
On Jun 26, 2009, at 12:32 PM, Jeff Eastman wrote:
> Grant Ingersoll wrote:
>> Isn't the KMeansJob pretty much redundant, assuming we add a
>> parameter to KMeansDriver to take in the number of reduce tasks?
> The purpose of the clustering jobs, in general, was to simplify
> computing the clusters and then clustering the data. It has been
> applied - and changed - inconsistently over the various
> implementations and some cleanup is warranted. It seems to me that
> having a job to do both steps is still valuable, though (as in the
> earlier kmeans synthetic control example) it may do the point
> clustering unnecessarily if it is blindly used as only entry point.
OK, but in this case, the KMJob actually takes in more parameters, not
less. BTW, this is not the same Job as the one used by the synthetic
control example, which I agree is more usable.
>
> I don't currently see how specifying the 'k' value explicitly can
> work in the current job and it is unrelated to the number of
> reducers. The 'k' value comes from the initial number of clusters. I
> think the implementation can use any number of reducers up to 'k'
> but don't recall seeing a test for that. One could add a job step
> that picks 'k' random centers from the data - as in your previous
> threads - and that job/driver would need to know 'k'. See below.
I have added the Random capability (locally). I'll put up my patch
shortly.
>
> For consistency, it seems to me that all the clustering jobs should
> uniformly facilitate these actions:
> 0. Set the initial clustering state
> 1. Compute a set of clusters given the input data points and the
> initial clustering state
> 2. Optionally cluster the input data points by assigning them to
> clusters. This would be with probabilities in the case of
> FuzzyKMeans and Dirichlet or one might just desire the most likely
> cluster.
>
> Canopy has no initial clustering state. For KMeans, this can be
> computed via running Canopy on the data or by selecting 'k' random
> points from the data, or by some other heuristic (un)related to the
> data. For Dirichlet, it is by sampling from the prior of the
> ModelDistribution; for MeanShift every input data point creates an
> initial canopy.
>
> (The various jobs, drivers and output directory structures produced
> by the different algorithms need to be cleaned up and made more
> consistent, IMO)
>>
>> Also, the variable naming in KMeansJob that the number of reduce
>> tasks (numCentroids) is actually the "k" in k-Means, even if this
>> value is currently fixed at 2 if using KMeansDriver? I'm trying to
>> make arg handling easier for MAHOUT-138.
> I thought I had already committed a change to rename this argument
> numReduceTasks so as to be consistent with its application in
> KMeansDriver.
In KMD it's called numReduceTasks, in KMJ it's called numCentroids,
which is what threw me.
I think with the new command line handling approach, the input
parameters become much more descriptive, which will make it easier for
people to consume the various drivers.
-Grant
Re: KMeansJob vs KMeansDriver
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
I didn't notice the --clusters option just reading the patch. If that
puts the clusters into a specific directory then fine. I was suggesting
the default be $output/state rather than currently just writing them all
to $output.
If you want some help I'm available some before next week then more. How
about you do Canopy and KMeans and I do the others, since those seem to
be in your critical path at the current time.
Jeff
Grant Ingersoll wrote:
>
> On Jun 26, 2009, at 3:04 PM, Jeff Eastman wrote:
>
>> That looks reasonable, just reading the patch. You might also want to
>> put the clusters-x files into a state (or clusters) sub-directory to
>> reduce noise in the output directory and improve consistency with MS
>> and Dirichlet (which do not themselves agree on which directory name
>> to use).
>
> The --clusters option allows them to specify the path. Or, are you
> suggesting there be a default of $output/clusters/
>
> For M-138, I'd like to convert all the drivers over to use CLI2, (help
> appreciated!)
>
>
Re: KMeansJob vs KMeansDriver
Posted by Grant Ingersoll <gs...@apache.org>.
On Jun 26, 2009, at 3:04 PM, Jeff Eastman wrote:
> That looks reasonable, just reading the patch. You might also want
> to put the clusters-x files into a state (or clusters) sub-directory
> to reduce noise in the output directory and improve consistency with
> MS and Dirichlet (which do not themselves agree on which directory
> name to use).
The --clusters option allows them to specify the path. Or, are you
suggesting there be a default of $output/clusters/
For M-138, I'd like to convert all the drivers over to use CLI2, (help
appreciated!)
Re: KMeansJob vs KMeansDriver
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
That looks reasonable, just reading the patch. You might also want to
put the clusters-x files into a state (or clusters) sub-directory to
reduce noise in the output directory and improve consistency with MS and
Dirichlet (which do not themselves agree on which directory name to use).
Grant Ingersoll wrote:
> Check out the patch I just put up on M-138
>
> On Jun 26, 2009, at 12:32 PM, Jeff Eastman wrote:
>
>> Grant Ingersoll wrote:
>>> Isn't the KMeansJob pretty much redundant, assuming we add a
>>> parameter to KMeansDriver to take in the number of reduce tasks?
>> The purpose of the clustering jobs, in general, was to simplify
>> computing the clusters and then clustering the data. It has been
>> applied - and changed - inconsistently over the various
>> implementations and some cleanup is warranted. It seems to me that
>> having a job to do both steps is still valuable, though (as in the
>> earlier kmeans synthetic control example) it may do the point
>> clustering unnecessarily if it is blindly used as only entry point.
>>
>> I don't currently see how specifying the 'k' value explicitly can
>> work in the current job and it is unrelated to the number of
>> reducers. The 'k' value comes from the initial number of clusters. I
>> think the implementation can use any number of reducers up to 'k' but
>> don't recall seeing a test for that. One could add a job step that
>> picks 'k' random centers from the data - as in your previous threads
>> - and that job/driver would need to know 'k'. See below.
>>
>> For consistency, it seems to me that all the clustering jobs should
>> uniformly facilitate these actions:
>> 0. Set the initial clustering state
>> 1. Compute a set of clusters given the input data points and the
>> initial clustering state
>> 2. Optionally cluster the input data points by assigning them to
>> clusters. This would be with probabilities in the case of FuzzyKMeans
>> and Dirichlet or one might just desire the most likely cluster.
>>
>> Canopy has no initial clustering state. For KMeans, this can be
>> computed via running Canopy on the data or by selecting 'k' random
>> points from the data, or by some other heuristic (un)related to the
>> data. For Dirichlet, it is by sampling from the prior of the
>> ModelDistribution; for MeanShift every input data point creates an
>> initial canopy.
>>
>> (The various jobs, drivers and output directory structures produced
>> by the different algorithms need to be cleaned up and made more
>> consistent, IMO)
>>>
>>> Also, the variable naming in KMeansJob that the number of reduce
>>> tasks (numCentroids) is actually the "k" in k-Means, even if this
>>> value is currently fixed at 2 if using KMeansDriver? I'm trying to
>>> make arg handling easier for MAHOUT-138.
>> I thought I had already committed a change to rename this argument
>> numReduceTasks so as to be consistent with its application in
>> KMeansDriver.
>>
>> Jeff
>
>
>
>
Re: KMeansJob vs KMeansDriver
Posted by Grant Ingersoll <gs...@apache.org>.
Check out the patch I just put up on M-138
On Jun 26, 2009, at 12:32 PM, Jeff Eastman wrote:
> Grant Ingersoll wrote:
>> Isn't the KMeansJob pretty much redundant, assuming we add a
>> parameter to KMeansDriver to take in the number of reduce tasks?
> The purpose of the clustering jobs, in general, was to simplify
> computing the clusters and then clustering the data. It has been
> applied - and changed - inconsistently over the various
> implementations and some cleanup is warranted. It seems to me that
> having a job to do both steps is still valuable, though (as in the
> earlier kmeans synthetic control example) it may do the point
> clustering unnecessarily if it is blindly used as only entry point.
>
> I don't currently see how specifying the 'k' value explicitly can
> work in the current job and it is unrelated to the number of
> reducers. The 'k' value comes from the initial number of clusters. I
> think the implementation can use any number of reducers up to 'k'
> but don't recall seeing a test for that. One could add a job step
> that picks 'k' random centers from the data - as in your previous
> threads - and that job/driver would need to know 'k'. See below.
>
> For consistency, it seems to me that all the clustering jobs should
> uniformly facilitate these actions:
> 0. Set the initial clustering state
> 1. Compute a set of clusters given the input data points and the
> initial clustering state
> 2. Optionally cluster the input data points by assigning them to
> clusters. This would be with probabilities in the case of
> FuzzyKMeans and Dirichlet or one might just desire the most likely
> cluster.
>
> Canopy has no initial clustering state. For KMeans, this can be
> computed via running Canopy on the data or by selecting 'k' random
> points from the data, or by some other heuristic (un)related to the
> data. For Dirichlet, it is by sampling from the prior of the
> ModelDistribution; for MeanShift every input data point creates an
> initial canopy.
>
> (The various jobs, drivers and output directory structures produced
> by the different algorithms need to be cleaned up and made more
> consistent, IMO)
>>
>> Also, the variable naming in KMeansJob that the number of reduce
>> tasks (numCentroids) is actually the "k" in k-Means, even if this
>> value is currently fixed at 2 if using KMeansDriver? I'm trying to
>> make arg handling easier for MAHOUT-138.
> I thought I had already committed a change to rename this argument
> numReduceTasks so as to be consistent with its application in
> KMeansDriver.
>
> Jeff
Re: KMeansJob vs KMeansDriver
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Grant Ingersoll wrote:
> Isn't the KMeansJob pretty much redundant, assuming we add a parameter
> to KMeansDriver to take in the number of reduce tasks?
The purpose of the clustering jobs, in general, was to simplify
computing the clusters and then clustering the data. It has been applied
- and changed - inconsistently over the various implementations and some
cleanup is warranted. It seems to me that having a job to do both steps
is still valuable, though (as in the earlier kmeans synthetic control
example) it may do the point clustering unnecessarily if it is blindly
used as only entry point.
I don't currently see how specifying the 'k' value explicitly can work
in the current job and it is unrelated to the number of reducers. The
'k' value comes from the initial number of clusters. I think the
implementation can use any number of reducers up to 'k' but don't recall
seeing a test for that. One could add a job step that picks 'k' random
centers from the data - as in your previous threads - and that
job/driver would need to know 'k'. See below.
For consistency, it seems to me that all the clustering jobs should
uniformly facilitate these actions:
0. Set the initial clustering state
1. Compute a set of clusters given the input data points and the initial
clustering state
2. Optionally cluster the input data points by assigning them to
clusters. This would be with probabilities in the case of FuzzyKMeans
and Dirichlet or one might just desire the most likely cluster.
Canopy has no initial clustering state. For KMeans, this can be computed
via running Canopy on the data or by selecting 'k' random points from
the data, or by some other heuristic (un)related to the data. For
Dirichlet, it is by sampling from the prior of the ModelDistribution;
for MeanShift every input data point creates an initial canopy.
(The various jobs, drivers and output directory structures produced by
the different algorithms need to be cleaned up and made more consistent,
IMO)
>
> Also, the variable naming in KMeansJob that the number of reduce tasks
> (numCentroids) is actually the "k" in k-Means, even if this value is
> currently fixed at 2 if using KMeansDriver? I'm trying to make arg
> handling easier for MAHOUT-138.
I thought I had already committed a change to rename this argument
numReduceTasks so as to be consistent with its application in KMeansDriver.
Jeff