You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Frank Scholten (JIRA)" <ji...@apache.org> on 2011/02/20 16:22:38 UTC

[jira] Created: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
----------------------------------------------------------------------------------------------

Key: MAHOUT-612
URL: https://issues.apache.org/jira/browse/MAHOUT-612
Project: Mahout
Issue Type: Improvement
Components: Clustering
Affects Versions: 0.4
Reporter: Frank Scholten

Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.

I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)

* The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
* The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.

I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this.

One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.

This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank Scholten updated MAHOUT-612:
----------------------------------

    Attachment: MAHOUT-612-v2.patch

Updated and expanded the patch. Renamed KMeansMapReduceJob to KMeansMapReduceAlgorithm and added KMeansSequentialAlgorithm.

These implmentations also create the points mapping by default, based on the runClustering flag.

The KMeansConfiguration can be used for both of these implementations.

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998047#comment-12998047 ] 

Frank Scholten commented on MAHOUT-612:
---------------------------------------

Ok, I'll tackle the canopy jobs next.

What exactly do you mean by serializing and deserializing the configuration object at once in the job/mapper?






> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015298#comment-13015298 ] 

Sean Owen commented on MAHOUT-612:
----------------------------------

May I start submitting the patches? the "v2" and "canopy" patches are ready to go?
What's the thinking about whether this will be considered done by 0.5 in a few weeks or should be an ongoing piece of work for the next release?

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999450#comment-12999450 ] 

Frank Scholten commented on MAHOUT-612:
---------------------------------------

Robin: Maybe I understand what you mean about serializing the config. At the moment the mappers and reducers still need to access values in the Configuration object via the config keys. Is it possible turn the (KMeans|Canopy)Configuration into a simple pojo, have it implement Writable and serialize it inside the Configuration and deserialize it at the mapper and reducer? Or does this have performance implications or other consequences?

We could maybe make a method in (KMeans|Canopy)Configuration

public Configuration asConfiguration() { ... }

where it serializes itself inside a Configuration and then returns it.

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010600#comment-13010600 ] 

Frank Scholten edited comment on MAHOUT-612 at 3/24/11 8:49 AM:
----------------------------------------------------------------

Yes, good idea. How about

interface SerializableConfiguration<T> {

  
  T getFromConfiguration(Configuration configuration);


  Configuration serializeInConfiguration(T t);


}
	
that will be implemented by KMeansConfiguration, CanopyConfiguration and configuration classes yet to be created.

The KMeansConfiguration equals is used in KMeansConfigurationTest via assertEquals.

      was (Author: frankscholten):
    Yes, good idea. How about

interface SerializableConfiguration<T> {
  
  T getFromConfiguration(Configuration configuration);

  Configuration serializeInConfiguration(T t);

}
	
that will be implemented by KMeansConfiguration, CanopyConfiguration and configuration classes yet to be created.

The KMeansConfiguration equals is used in KMeansConfigurationTest via assertEquals.
  
> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-612:
-----------------------------

    Affects Version/s: 0.5
             Assignee: Sean Owen

Frank, how far along are you here? It would be great to commit this once you've hit all the jobs you intend to.

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4, 0.5
>            Reporter: Frank Scholten
>            Assignee: Sean Owen
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-kmeans.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010600#comment-13010600 ] 

Frank Scholten edited comment on MAHOUT-612 at 3/24/11 8:53 AM:
----------------------------------------------------------------

Yes, good idea. How about

{code:java}
interface SerializableConfiguration<T> {
 
  T getFromConfiguration(Configuration configuration);

  Configuration serializeInConfiguration(T t);

}
{code}
	
that will be implemented by KMeansConfiguration, CanopyConfiguration and configuration classes yet to be created.

The KMeansConfiguration equals is used in KMeansConfigurationTest via assertEquals.

      was (Author: frankscholten):
    Yes, good idea. How about

interface SerializableConfiguration<T> {

  
  T getFromConfiguration(Configuration configuration);


  Configuration serializeInConfiguration(T t);


}
	
that will be implemented by KMeansConfiguration, CanopyConfiguration and configuration classes yet to be created.

The KMeansConfiguration equals is used in KMeansConfigurationTest via assertEquals.
  
> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-612:
-----------------------------

    Fix Version/s:     (was: 0.6)
    
> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering, Collaborative Filtering
>    Affects Versions: 0.4, 0.5
>            Reporter: Frank Scholten
>            Assignee: Sean Owen
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-kmeans.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040786#comment-13040786 ] 

Frank Scholten commented on MAHOUT-612:
---------------------------------------

Still at KMeans and Canopy. After Berlin Buzzwords I'll have time to continue with this issue.

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4, 0.5
>            Reporter: Frank Scholten
>            Assignee: Sean Owen
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-kmeans.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Ian Helmke (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062135#comment-13062135 ] 

Ian Helmke commented on MAHOUT-612:
-----------------------------------

Frank, are you still making changes here? Benson and I are looking to continue/complete the beanification of these jobs. Just wondering if you'd made any progress on it.

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4, 0.5
>            Reporter: Frank Scholten
>            Assignee: Sean Owen
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-kmeans.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000222#comment-13000222 ] 

Isabel Drost commented on MAHOUT-612:
-------------------------------------

Robin: Putting my Apache Hat on - I know how easy github makes collaboration, however it would be nice to keep development inside of our project, so until Apache supports for git r/w access, I was only wondering whether an svn branch would provide any benefit ...

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998284#comment-12998284 ] 

Isabel Drost commented on MAHOUT-612:
-------------------------------------

Robin I agree with your concern of avoiding an un-even state in the code-base. Given the anticipated amount of work that has to go into this, would make sense to track these changes in a separate branch to avoid the "one huge patch that touches everything at once" problem?

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146246#comment-13146246 ] 

Grant Ingersoll commented on MAHOUT-612:
----------------------------------------

It seems like we shouldn't have to wait for the whole thing to be done on this.  Forward progress towards where we want to go is better than no progress.
                
> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering, Collaborative Filtering
>    Affects Versions: 0.4, 0.5
>            Reporter: Frank Scholten
>            Assignee: Sean Owen
>             Fix For: Backlog
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-kmeans.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank Scholten updated MAHOUT-612:
----------------------------------

    Attachment: MAHOUT-612-kmeans.patch

Latest version of K-Means driver refactoring in sync with trunk

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-kmeans.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009775#comment-13009775 ] 

Robin Anil commented on MAHOUT-612:
-----------------------------------

Sorry about the late reply

The methods serialized and deserialized can be made stricter using an interface and maybe renamed to make it explicit.

Maybe an interface which has the "SerializeInConfiguration() GetFromConfiguration()" method, to make things strictly uniform.

It looks good otherwise. 

A Small Nit: Autogenerated Equals() and Hashcode() is ok, but do you see it being used? Made as keys in hashMaps? You can choose to ignore them if you wish (or throw an exception). IMO Config Object's primary purpose is to Deserialize or Serialize its members.



Robin


> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062223#comment-13062223 ] 

Frank Scholten commented on MAHOUT-612:
---------------------------------------

Cool that you're interested! I recently rebased my changes locally on a Mahout 0.5 branch and I'm making the Canopy configuration consistent with the KMeans configuration, wrt serialization and coding style. This is taking some time as I'm fixing a bunch of Canopy unit tests. I will push this to Github soon.

After this I think it's important that SparseVectorsToSequenceFile is refactored since it's almost always needed for clustering jobs.


> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4, 0.5
>            Reporter: Frank Scholten
>            Assignee: Sean Owen
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-kmeans.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010600#comment-13010600 ] 

Frank Scholten commented on MAHOUT-612:
---------------------------------------

Yes, good idea. How about

interface SerializableConfiguration<T> {
  
  T getFromConfiguration(Configuration configuration);

  Configuration serializeInConfiguration(T t);

}
	
that will be implemented by KMeansConfiguration, CanopyConfiguration and configuration classes yet to be created.

The KMeansConfiguration equals is used in KMeansConfigurationTest via assertEquals.

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-612:
-----------------------------

      Component/s: Collaborative Filtering
                   Classification
    Fix Version/s: 0.6

Looking good, marking for 0.6

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering, Collaborative Filtering
>    Affects Versions: 0.4, 0.5
>            Reporter: Frank Scholten
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-kmeans.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064183#comment-13064183 ] 

Frank Scholten commented on MAHOUT-612:
---------------------------------------

Just pushed a new branch to Github, https://github.com/frankscholten/mahout/tree/MAHOUT-612-0.5, rebased at 0.5 with one commit of both KMeans and Canopy config. Next up is SparseVectorsFromSequenceFiles.

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4, 0.5
>            Reporter: Frank Scholten
>            Assignee: Sean Owen
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-kmeans.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-612:
-----------------------------

    Assignee:     (was: Sean Owen)
    
> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering, Collaborative Filtering
>    Affects Versions: 0.4, 0.5
>            Reporter: Frank Scholten
>             Fix For: Backlog
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-kmeans.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12997618#comment-12997618 ] 

Sean Owen commented on MAHOUT-612:
----------------------------------

I think I understand your patch. You're leaving KMeansDriver as the shell with which to run it from the command line, but introducing one more layer of abstraction between it and running Hadoop MapReduces so that it can be invoked programmatically. Sounds fine to me.

My only bit of feedback then is about naming. We unfortunately have some conflicting naming here for the command-line class that runs MapReduces and implements Tool. It's a "*Job" in some places and "*Driver" in other places. (Anyone prefer one convention? I could JIRA that too.)

To avoid deepening the confusion, consider renaming KMeansMapReduceJob to something that doesn't end in either of those. :)

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999721#comment-12999721 ] 

Robin Anil commented on MAHOUT-612:
-----------------------------------

Isabel: Yeah github is the easiest way to go.

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998038#comment-12998038 ] 

Robin Anil commented on MAHOUT-612:
-----------------------------------

Indeed! Removes a whole lot of paramaters from the function. Code looks a lot nicer now. I would like to have the configuration object serialized and deserialized at once in job/mapper. or merged with the configuration object in some generic way maybe a base MahoutConfigBase class. All these are nice to haves. If you can, I would really appreciate such a change. 

But, before committing I will wait for the full change to all Jobs so that code is not in un-even state.





> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015324#comment-13015324 ] 

Frank Scholten commented on MAHOUT-612:
---------------------------------------

These patches are outdated. This will indeed be ongoing piece of work
and won't be done by 0.5

What's the thinking on how to include the work? Robin said: "But,
before committing I will wait for the full change to all Jobs so that
code is not in un-even state."

I would prefer to be able to submit a patch per job configuration, one
for K-means, one for Canopy and so on. The code will be in an un-even
state, true, however this will prevent a lot of effort to merge
changes in trunk later on, considering how active Mahout is being developed.

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank Scholten updated MAHOUT-612:
----------------------------------

    Fix Version/s: Backlog

Putting this in backlog for now. As much as I like the idea of this patch, improving parts of the clustering code I work with regularly has a higher priority for me. So far the kmeans, canopy and seq2sparse jobs have been refactored to have a bean configuration. If you want to help with this, check out the github repo at https://github.com/frankscholten/mahout/tree/MAHOUT-612-0.5
                
> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering, Collaborative Filtering
>    Affects Versions: 0.4, 0.5
>            Reporter: Frank Scholten
>            Assignee: Sean Owen
>             Fix For: Backlog
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-kmeans.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank Scholten updated MAHOUT-612:
----------------------------------

    Fix Version/s: 0.5
           Status: Patch Available  (was: Open)

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998025#comment-12998025 ] 

Sean Owen commented on MAHOUT-612:
----------------------------------

This looks like quite a positive change, at the macro and micro level. Robin any thoughts? I can commit in a short while otherwise.

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005679#comment-13005679 ] 

Frank Scholten commented on MAHOUT-612:
---------------------------------------

I added K-Means config serialization code at https://github.com/frankscholten/mahout/tree/MAHOUT-612 See 'kmeans-serialization' tag

Robin: Is this close to what you had in mind?

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000360#comment-13000360 ] 

Ted Dunning commented on MAHOUT-612:
------------------------------------

Isabel,

I find that keeping large patches up to date with only an SVN branch is infeasible.  Thus, I opt to use git privately and use the SVN interface to push changes back to SVN when committing.

Once I am doing that, why not share my git repository so that others can comment on the work in progress?  I still will have to be careful about what code I incorporate, but that is the responsibility of a committer in any case.

Hopefully, this should become irrelevant soon since Apache is making rapid progress on supporting git.


> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-612:
-----------------------------

    Fix Version/s:     (was: 0.5)

I think one big patch is preferable. It avoids risk that the patch can't be completed for some reason. It should be about as much work, including merge conflicts. But I don't think it is a big deal if you'd like to do it piece by piece too.

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999720#comment-12999720 ] 

Robin Anil commented on MAHOUT-612:
-----------------------------------

See FPGrowthParameters. It does something similar. I do not think that will have any performance effect, we are talking about < 10KB data here.


> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000558#comment-13000558 ] 

Frank Scholten commented on MAHOUT-612:
---------------------------------------

I started a MAHOUT-612 branch at https://github.com/frankscholten/mahout/tree/MAHOUT-612 and added the K-Means v2 and Canopy patches.

Robin: Ok, I'll look into the serialization issue for K-Means and Canopy next.

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank Scholten updated MAHOUT-612:
----------------------------------

    Attachment: MAHOUT-612-canopy.patch

Added patch for the canopy jobs.

This time I also moved the logic of composing output paths (canopies and points) from the CanopyDriver into the configuration object.

The config keys are now moved to CanopyConfiguration and CanopyConfigKeys is removed. Some keys are different because I renamed output to outputBasePath to make it clear the canopy and points outputs are relative paths under this outputBasePath.

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Issue Comment Edited: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12997996#comment-12997996 ] 

Frank Scholten edited comment on MAHOUT-612 at 2/22/11 7:43 PM:
----------------------------------------------------------------

Updated and expanded the patch. Renamed KMeansMapReduceJob to KMeansMapReduceAlgorithm and added KMeansSequentialAlgorithm.

These implementations also create the points mapping by default, based on the runClustering flag.

The KMeansConfiguration can be used for both of these implementations.

      was (Author: frankscholten):
    Updated and expanded the patch. Renamed KMeansMapReduceJob to KMeansMapReduceAlgorithm and added KMeansSequentialAlgorithm.

These implmentations also create the points mapping by default, based on the runClustering flag.

The KMeansConfiguration can be used for both of these implementations.
  
> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066509#comment-13066509 ] 

Frank Scholten commented on MAHOUT-612:
---------------------------------------

Pushed seq2sparse configuration to Github http://bit.ly/rmWAf4

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4, 0.5
>            Reporter: Frank Scholten
>            Assignee: Sean Owen
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-kmeans.patch, MAHOUT-612-v2.patch, MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-612) Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration

Posted by "Frank Scholten (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank Scholten updated MAHOUT-612:
----------------------------------

    Attachment: MAHOUT-612.patch

> Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.
> This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira