You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/09/12 19:21:03 UTC

Re: Clustering on Elastic Map Reduce

moving to dev@

So, I can run KMeansDriver directly on EMR, but one of the things I want to do is actually run MahoutDriver on EMR.  The only sticking point to this are the lines:
<snip classname="MahoutDriver">
InputStream propsStream = Thread.currentThread()
                                    .getContextClassLoader()
                                    .getResourceAsStream("driver.classes.props");

    mainClasses.load(propsStream);
</snip>

due to the fact that the properties files are not in the class path that EMR gets.

Anyone have suggestions on working around this?  

My first thought is to create a JOB jar that contains the properties, but the thought occurred to me that there might be a way to enhance the classpath.  Other thoughts:
1. Instead of requiring driver.classes.props, we could just have an Interface that each of those drivers implements that reports it's short name and description and then we just need to do some reflection at startup to get all implementers of the interface.
2. We create a "default.driver.classes.props" that is actually packaged into the JOB jar.  We first look for driver.classes.props then we look for default.driver.classes.props, then we throw an exception.

I guess my preference is #2, since that is the least code, still allows the existing functionality to work and provides reasonable defaults w/o any setup.

Thoughts?

-Grant

On Sep 12, 2010, at 8:07 AM, Grant Ingersoll wrote:

> 
> On Sep 12, 2010, at 7:42 AM, Grant Ingersoll wrote:
> 
>> 
>> On Sep 11, 2010, at 10:11 PM, Drew Farris wrote:
>> 
>>> I will write up notes on the EMR wiki page.
> 
> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce is updated to 0.4-SNAPSHOT.
> 
> -Grant
> 
> 


Re: Clustering on Elastic Map Reduce

Posted by Sean Owen <sr...@gmail.com>.
I probably misunderstood the original problem. I had assumed the issue
was in getting the right .jar out to the workers. If it's just getting
stuff to the driver, yeah packaging properties files in the .jar file
should work.

On Mon, Sep 13, 2010 at 1:01 AM, Jake Mannix <ja...@gmail.com> wrote:
> Hmm?  Why would the workers need the driver.classes.props file?  It's what
> determines what MR job to run - once you're on a worker node, you're done
> with it, aren't you?  Or am I not following what the issue is...

Re: Clustering on Elastic Map Reduce

Posted by Jake Mannix <ja...@gmail.com>.
Hmm?  Why would the workers need the driver.classes.props file?  It's what
determines what MR job to run - once you're on a worker node, you're done
with it, aren't you?  Or am I not following what the issue is...

  -jake

On Sun, Sep 12, 2010 at 4:40 PM, Sean Owen <sr...@gmail.com> wrote:

> From the props file? My understanding is that it doesn't survive to the
> worker but does to the driver. Not quite so?
>
> On Sep 13, 2010 12:37 AM, "Jake Mannix" <ja...@gmail.com> wrote:
>
> On Sun, Sep 12, 2010 at 12:45 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > 2 isn't how it is 'supposed...
> But where would the Driver get the values to put into the Configuration
> object?
>
>  -jake
>

Re: Clustering on Elastic Map Reduce

Posted by Sean Owen <sr...@gmail.com>.
>From the props file? My understanding is that it doesn't survive to the
worker but does to the driver. Not quite so?

On Sep 13, 2010 12:37 AM, "Jake Mannix" <ja...@gmail.com> wrote:

On Sun, Sep 12, 2010 at 12:45 PM, Sean Owen <sr...@gmail.com> wrote:

> 2 isn't how it is 'supposed...
But where would the Driver get the values to put into the Configuration
object?

 -jake

Re: Clustering on Elastic Map Reduce

Posted by Jake Mannix <ja...@gmail.com>.
On Sun, Sep 12, 2010 at 12:45 PM, Sean Owen <sr...@gmail.com> wrote:

> 2 isn't how it is 'supposed' to work. The Configuration object is how you
> pass to the job any name - value pairs.
>
> The more right way is for the Driver to copy the properties entries into
> Configuration. Everything downstream can see that.
>
> I think we would do well to keep it simple here. There are already props
> files and two flavors of command line args in play for configuration.
>

But where would the Driver get the values to put into the Configuration
object?

  -jake

Re: Clustering on Elastic Map Reduce

Posted by Sean Owen <sr...@gmail.com>.
2 isn't how it is 'supposed' to work. The Configuration object is how you
pass to the job any name - value pairs.

The more right way is for the Driver to copy the properties entries into
Configuration. Everything downstream can see that.

I think we would do well to keep it simple here. There are already props
files and two flavors of command line args in play for configuration.

Sean

On Sep 12, 2010 7:41 PM, "Ted Dunning" <te...@gmail.com> wrote:
> The reflection option sounds dangerous because it isn't clear that the
> classes will be loaded yet which would mean that they wouldn't be seen.
>
> Option 2 is, as you say, relatively simple.
>
> On Sun, Sep 12, 2010 at 10:21 AM, Grant Ingersoll <gsingers@apache.org
>wrote:
>
>> My first thought is to create a JOB jar that contains the properties, but
>> the thought occurred to me that there might be a way to enhance the
>> classpath. Other thoughts:
>> 1. Instead of requiring driver.classes.props, we could just have an
>> Interface that each of those drivers implements that reports it's short
name
>> and description and then we just need to do some reflection at startup to
>> get all implementers of the interface.
>> 2. We create a "default.driver.classes.props" that is actually packaged
>> into the JOB jar. We first look for driver.classes.props then we look for
>> default.driver.classes.props, then we throw an exception.
>>
>> I guess my preference is #2, since that is the least code, still allows
the
>> existing functionality to work and provides reasonable defaults w/o any
>> setup.
>>

Re: Clustering on Elastic Map Reduce

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 12, 2010, at 7:35 PM, Jake Mannix wrote:

> On Sun, Sep 12, 2010 at 12:23 PM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> 
>>> Option 2 is, as you say, relatively simple.
>> 
>> I have this working and will post/commit a patch.

https://issues.apache.org/jira/browse/MAHOUT-500 has the patch.  It's a pretty trivial change and I just use the existing driver.classes.props file (renaming it) so that we don't have to maintain two copies.

-Grant

Re: Clustering on Elastic Map Reduce

Posted by Jake Mannix <ja...@gmail.com>.
On Sun, Sep 12, 2010 at 12:23 PM, Grant Ingersoll <gs...@apache.org>wrote:

>
> > Option 2 is, as you say, relatively simple.
>
> I have this working and will post/commit a patch.


+1  on this - it was what I'd had in mind originally with the
driver.class.props file.  In fact, it's the one and only .props file in the
conf directory which is "required", and is only accessible to users because
they can easily add their own driver classes which would be used by the
MahoutDriver by editing this file.  Having a default set of values either in
the .job file, or hardcoded into the MahoutDriver would make sure that file
isn't needed for the general use case.

  -jake

Re: Clustering on Elastic Map Reduce

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 12, 2010, at 2:40 PM, Ted Dunning wrote:

> The reflection option sounds dangerous because it isn't clear that the
> classes will be loaded yet which would mean that they wouldn't be seen.

Agreed.

> 
> Option 2 is, as you say, relatively simple.

I have this working and will post/commit a patch.

> 
> On Sun, Sep 12, 2010 at 10:21 AM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> My first thought is to create a JOB jar that contains the properties, but
>> the thought occurred to me that there might be a way to enhance the
>> classpath.  Other thoughts:
>> 1. Instead of requiring driver.classes.props, we could just have an
>> Interface that each of those drivers implements that reports it's short name
>> and description and then we just need to do some reflection at startup to
>> get all implementers of the interface.
>> 2. We create a "default.driver.classes.props" that is actually packaged
>> into the JOB jar.  We first look for driver.classes.props then we look for
>> default.driver.classes.props, then we throw an exception.
>> 
>> I guess my preference is #2, since that is the least code, still allows the
>> existing functionality to work and provides reasonable defaults w/o any
>> setup.
>> 

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8


Re: Clustering on Elastic Map Reduce

Posted by Ted Dunning <te...@gmail.com>.
The reflection option sounds dangerous because it isn't clear that the
classes will be loaded yet which would mean that they wouldn't be seen.

Option 2 is, as you say, relatively simple.

On Sun, Sep 12, 2010 at 10:21 AM, Grant Ingersoll <gs...@apache.org>wrote:

> My first thought is to create a JOB jar that contains the properties, but
> the thought occurred to me that there might be a way to enhance the
> classpath.  Other thoughts:
> 1. Instead of requiring driver.classes.props, we could just have an
> Interface that each of those drivers implements that reports it's short name
> and description and then we just need to do some reflection at startup to
> get all implementers of the interface.
> 2. We create a "default.driver.classes.props" that is actually packaged
> into the JOB jar.  We first look for driver.classes.props then we look for
> default.driver.classes.props, then we throw an exception.
>
> I guess my preference is #2, since that is the least code, still allows the
> existing functionality to work and provides reasonable defaults w/o any
> setup.
>