You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2010/07/15 01:12:50 UTC

[jira] Issue Comment Edited: (MAHOUT-294) Uniform API behavior for Jobs

    [ https://issues.apache.org/jira/browse/MAHOUT-294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888561#action_12888561 ] 

Jeff Eastman edited comment on MAHOUT-294 at 7/14/10 7:12 PM:
--------------------------------------------------------------

I've seen the light and attached is a patch that goes most of the way towards being a good child of AbstractJob. It refactors the command parsing out of main() into run() and also refactors the guts out of static runJob() into non-static job() which does all the heavy lifting. Job() is also called from run() so the whole thing hangs together pretty well and none of the runJob clients were impacted.

On the DefaultOptionsCreator, I did some constant extraction and inlined any methods that were called only by one other driver. The remaining options are shared and i used addOption(DefaultOptionsCreator.blahOption().create()) rather than exploding them to use the expanded addOption(...). I found a precedent for this in CollocDriver and it seemed like a lot of busy work to refactor all the usages of DefaultOptionsCreator so I did not do that.

I also did not use AbstractJob.prepareJob() but left the individual conf and job initializations in place since they are working. There was also precedent for this. Clustering has several job steps (e.g. runClustering)  which do not use reducers and prepareJob doesn't work for them.

In terms of feedback on the AbstractJob design, I find the need to create two constants unwieldy as in:
{code}
  private static final String NUM_CLUSTERS_OPTION = "numClusters";
  public static final String NUM_CLUSTERS_OPTION_KEY = "--" + NUM_CLUSTERS_OPTION;
{code}

since the options argument map is keyed by "--" prepended to the option's long name. My preference would be to omit that but I see the code is using Option.preferredName() which is not subject to modification.

All the unit tests run but they don't really test the command line processing. The clustering tests use runJob() with java arguments instead. I admit to being a bit on the lazy side in terms of some possible refactoring but think this is pretty close to the target. 


      was (Author: jeastman):
    I've seen the light and attached is a patch that goes most of the way towards being a good child of AbstractJob. It refactors the command parsing out of main() into run() and also refactors the guts out of static runJob() into non-static job() which does all the heavy lifting. Job() is also called from run() so the whole thing hangs together pretty well and none of the runJob clients were impacted.

On the DefaultOptionsCreator, I did some constant extraction and inlined any methods that were called only by one other driver. The remaining options are shared and i used addOption(DefaultOptionsCreator.blahOption().create()) rather than exploding them to use the expanded addOption(...). I found a precedent for this in CollocDriver and it seemed like a lot of busy work to refactor all the usages of DefaultOptionsCreator so I did not do that.

I also did not use AbstractJob.prepareJob() but left the individual conf and job initializations in place since they are working. There was also precedent for this. Clustering has several job steps (e.g. runClustering)  which do not use reducers and prepareJob doesn't work for them.

In terms of feedback on the AbstractJob design, I find the need to create two constants unwieldy as in:
{code}
  private static final String NUM_CLUSTERS_OPTION = "numClusters";
  public static final Object NUM_CLUSTERS_OPTION_KEY = "--" + NUM_CLUSTERS_OPTION;
{code}

since the options argument map is keyed by "--" prepended to the option's long name. My preference would be to omit that but I see the code is using Option.preferredName() which is not subject to modification.

All the unit tests run but they don't really test the command line processing. The clustering tests use runJob() with java arguments instead. I admit to being a bit on the lazy side in terms of some possible refactoring but think this is pretty close to the target. 

  
> Uniform API behavior for Jobs
> -----------------------------
>
>                 Key: MAHOUT-294
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-294
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering, Collaborative Filtering, Frequent Itemset/Association Rule Mining, Genetic Algorithms, Math, Utils
>    Affects Versions: 0.4
>            Reporter: Robin Anil
>             Fix For: 0.4
>
>         Attachments: MAHOUT-294.patch, MAHOUT-294.patch
>
>
> * Move AbstractJob to common and convert all the Driver classes to extend that.
>    One suggestion is:
>    AlgorithmParams params = ParamsBuilder.build().withParam("-i", input).withParam("-o", output)....
>    MyAlgorithmn.runJob(params) throws ParameterMissingException;
> * Give uniform command-line parameters for various algorithms.
>    e.g Currently distance measure is -d, -dm, -m at different places in clustering
> * Add a temp directory as a parameter http://www.lucidimagination.com/search/document/28a979aa62c02a1/who_owns_mahout_bucket_on_s3#ddb5855e8bdace45
> This issue will keep track of all discussion/patches related to the design and cleanup of Mahout API

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.