You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2010/06/13 20:08:13 UTC

[jira] Created: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency
-------------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-414
                 URL: https://issues.apache.org/jira/browse/MAHOUT-414
             Project: Mahout
          Issue Type: Bug
    Affects Versions: 0.3
            Reporter: Jeff Eastman
             Fix For: 0.4


If specifying the number of mappers and reducers is a common activity which users need to perform in running Mahout applications on Hadoop clusters then we need to have a standard way of specifying them in our APIs without exposing the full set of Hadoop options, especially for our non-power-users. This is the case for some applications already but others require the use of Hadoop-level -D arguments to achieve reasonable out-of-the-box parallelism even when running our examples. The usability defect is that some of our algorithms won't scale without it and that we don't have a standard way to express this in our APIs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916490#action_12916490 ] 

Sean Owen commented on MAHOUT-414:
----------------------------------

Jeff would you regard this as complete? I'm the only other one that commented here and am personally pleased to let you finish this out as you like.

> Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-414
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-414
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Jeff Eastman
>             Fix For: 0.4
>
>
> If specifying the number of mappers and reducers is a common activity which users need to perform in running Mahout applications on Hadoop clusters then we need to have a standard way of specifying them in our APIs without exposing the full set of Hadoop options, especially for our non-power-users. This is the case for some applications already but others require the use of Hadoop-level -D arguments to achieve reasonable out-of-the-box parallelism even when running our examples. The usability defect is that some of our algorithms won't scale without it and that we don't have a standard way to express this in our APIs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914753#action_12914753 ] 

Hudson commented on MAHOUT-414:
-------------------------------

Integrated in Mahout-Quality #325 (See [https://hudson.apache.org/hudson/job/Mahout-Quality/325/])
    

> Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-414
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-414
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Jeff Eastman
>             Fix For: 0.4
>
>
> If specifying the number of mappers and reducers is a common activity which users need to perform in running Mahout applications on Hadoop clusters then we need to have a standard way of specifying them in our APIs without exposing the full set of Hadoop options, especially for our non-power-users. This is the case for some applications already but others require the use of Hadoop-level -D arguments to achieve reasonable out-of-the-box parallelism even when running our examples. The usability defect is that some of our algorithms won't scale without it and that we don't have a standard way to express this in our APIs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  NVM on -D. It works from the command line but not running the Driver 
job main directly from Eclipse. The -D foo.bar.baz=11 is correct as 
advertised.
> Now I'm trying to use a -D argument to set a configuration parameter 
> but the parser won't accept it. I've tried -D foo.bar.baz=11 and 
> -Dfoo.bar.baz=11 with no joy on either. What is the correct syntax?
>


Re: [jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  I've converted all the clustering to this model and am about to 
commit. I added Configuration argument to all the Java methods and 
removed numReducers. I also deprecated the 
DefaultOptionsCreator.numReducersOption. I'm actually starting to like 
putting all the Hadoop arguments into Configuration because the Java 
methods now have only algorithm-specific arguments.

On 9/22/10 5:14 PM, Sean Owen wrote:
> You probably know more about this than I, but conceptually, all the
> params for a Hadoop job are in a Configuration. This includes
> command-line Hadoop args and anything else the code tosses in.
>
> One way or another the Job has to get such a Configuration. ToolRunner
> takes care of this in the command line case. A Java caller would have
> to also find a way to construct a Configuration.
>
> That says to me that something somewhere takes a Configuration as
> input in order to run the job. And that something can be called from a
> main() method plus ToolRunner for the command line case, or directly
> as a Java method.
>
> I think that's about the right idea in theory, don't know how it
> meshes with practice?
>
>
> No the static method thing is something Findbugs and PMD flag, as well
> as IntelliJ. It's not always true that a method should be static
> because it can (for example if you design for inheritance) but I had
> thought this was not the nature of a Driver.
>
> On Wed, Sep 22, 2010 at 7:32 PM, Jeff Eastman
> <jd...@windwardsolutions.com>  wrote:
>>   Exactly my point. Current clustering and a lot of other drivers don't call
>> ToolRunner in their main method; they do new Driver().run(). This needs to
>> be changed everywhere. The job() methods currently create new Configuration
>> objects since they are invoked mostly from Java in unit tests and layered
>> jobs (e.g. synthetic control). I've got a version of Canopy that does call
>> ToolRunner and it does return a populated Configuration from getConf() but,
>> since the job methods are now static, they can't call it; it needs to be an
>> explicit argument. So, I've added conf as the first parameter to job() (and
>> left a convenience version without it), and that seems to work.
>>
>> Now I'm trying to use a -D argument to set a configuration parameter but the
>> parser won't accept it. I've tried -D foo.bar.baz=11 and -Dfoo.bar.baz=11
>> with no joy on either. What is the correct syntax?
>>
>> On the separate question of explicit numReducers arguments to the Java
>> methods and the CLI I'm all for doing it consistently. It's more work for
>> Java callers to create and set the conf parameter than it is with an
>> explicit argument but most current callers would use the convenience method
>> anyway.
>>
>> On the static conversions themselves, new Foo().run() is how they used to do
>> it but, as you noted earlier, it should be ToolRunner.run(class, conf, args)
>> anyway. Since run() *is* an instance method it seemed more correct to have
>> the methods it called also be instance methods. In clustering, the methods
>> used to be static when I wrote them so I can't claim to be an OO purist,
>> though I still don't like them. Just trying to sort out the motivation for
>> the change: was this PMD, Checkstyle, or Seanstyle<g>?
>>
>> On 9/22/10 1:53 PM, Sean Owen wrote:
>>> Let me try
>>>
>>> On Wed, Sep 22, 2010 at 3:32 PM, Jeff Eastman
>>> <jd...@windwardsolutions.com>    wrote:
>>>>   The clustering drivers all call new Configuration() in their
>>>> implementations. When run only from the CLI, other Mahout jobs call
>>>> getConf() which is where the -D arguments get pulled in (right?). So
>>>> there
>>> This comes from using ToolRunner.run(). It sets up all those args, and
>>> then calls Tool.run(). So when you implement Tool, in run(), the
>>> result of getConf() has all that stuff.
>>>
>>> Inside, it's org.apache.hadoop.util.GenericOptionsParser that does that
>>> work.
>>>
>>> I think your point is that this doesn't hold up for the case of
>>> invoking from some arbitrary Java calling code. Yes, in that case, the
>>> caller might have to populate a Configuration object (or be able to
>>> modify it) to pass this sort of setting. At least that's how I'd play
>>> it.
>>>
>>> But then the question of adding a new command-line argument doesn't
>>> help this use case anyway.
>>>
>>> Am I following?
>>>
>>>
>>>> And what was the PMD/Checkstyle problem with instance methods on the
>>>> drivers
>>>> that motivated the regression to statics? I hate statics.
>>> The reasoning was simply that the methods used no instance methods or
>>> members. It was already "really" a static method.
>>>
>>> I have little problem with the hard-line OO approach that even such
>>> Driver classes ought to be full of instance methods anyway, and
>>> perhaps have this bit of glue to the non-object-oriented world at the
>>> end:
>>>
>>> public static void main(String[] args) {
>>>    new Foo().doIt();
>>> }
>>>
>>> ... but I guess I'm saying it did not seem to be written that way?
>>> Things were passed around as method args when they could otherwise be
>>> instance members. So it looked like the intent was a static method
>>> anyhow.
>>>


Re: [jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Posted by Sean Owen <sr...@gmail.com>.
You probably know more about this than I, but conceptually, all the
params for a Hadoop job are in a Configuration. This includes
command-line Hadoop args and anything else the code tosses in.

One way or another the Job has to get such a Configuration. ToolRunner
takes care of this in the command line case. A Java caller would have
to also find a way to construct a Configuration.

That says to me that something somewhere takes a Configuration as
input in order to run the job. And that something can be called from a
main() method plus ToolRunner for the command line case, or directly
as a Java method.

I think that's about the right idea in theory, don't know how it
meshes with practice?


No the static method thing is something Findbugs and PMD flag, as well
as IntelliJ. It's not always true that a method should be static
because it can (for example if you design for inheritance) but I had
thought this was not the nature of a Driver.

On Wed, Sep 22, 2010 at 7:32 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
>  Exactly my point. Current clustering and a lot of other drivers don't call
> ToolRunner in their main method; they do new Driver().run(). This needs to
> be changed everywhere. The job() methods currently create new Configuration
> objects since they are invoked mostly from Java in unit tests and layered
> jobs (e.g. synthetic control). I've got a version of Canopy that does call
> ToolRunner and it does return a populated Configuration from getConf() but,
> since the job methods are now static, they can't call it; it needs to be an
> explicit argument. So, I've added conf as the first parameter to job() (and
> left a convenience version without it), and that seems to work.
>
> Now I'm trying to use a -D argument to set a configuration parameter but the
> parser won't accept it. I've tried -D foo.bar.baz=11 and -Dfoo.bar.baz=11
> with no joy on either. What is the correct syntax?
>
> On the separate question of explicit numReducers arguments to the Java
> methods and the CLI I'm all for doing it consistently. It's more work for
> Java callers to create and set the conf parameter than it is with an
> explicit argument but most current callers would use the convenience method
> anyway.
>
> On the static conversions themselves, new Foo().run() is how they used to do
> it but, as you noted earlier, it should be ToolRunner.run(class, conf, args)
> anyway. Since run() *is* an instance method it seemed more correct to have
> the methods it called also be instance methods. In clustering, the methods
> used to be static when I wrote them so I can't claim to be an OO purist,
> though I still don't like them. Just trying to sort out the motivation for
> the change: was this PMD, Checkstyle, or Seanstyle <g>?
>
> On 9/22/10 1:53 PM, Sean Owen wrote:
>>
>> Let me try
>>
>> On Wed, Sep 22, 2010 at 3:32 PM, Jeff Eastman
>> <jd...@windwardsolutions.com>  wrote:
>>>
>>>  The clustering drivers all call new Configuration() in their
>>> implementations. When run only from the CLI, other Mahout jobs call
>>> getConf() which is where the -D arguments get pulled in (right?). So
>>> there
>>
>> This comes from using ToolRunner.run(). It sets up all those args, and
>> then calls Tool.run(). So when you implement Tool, in run(), the
>> result of getConf() has all that stuff.
>>
>> Inside, it's org.apache.hadoop.util.GenericOptionsParser that does that
>> work.
>>
>> I think your point is that this doesn't hold up for the case of
>> invoking from some arbitrary Java calling code. Yes, in that case, the
>> caller might have to populate a Configuration object (or be able to
>> modify it) to pass this sort of setting. At least that's how I'd play
>> it.
>>
>> But then the question of adding a new command-line argument doesn't
>> help this use case anyway.
>>
>> Am I following?
>>
>>
>>> And what was the PMD/Checkstyle problem with instance methods on the
>>> drivers
>>> that motivated the regression to statics? I hate statics.
>>
>> The reasoning was simply that the methods used no instance methods or
>> members. It was already "really" a static method.
>>
>> I have little problem with the hard-line OO approach that even such
>> Driver classes ought to be full of instance methods anyway, and
>> perhaps have this bit of glue to the non-object-oriented world at the
>> end:
>>
>> public static void main(String[] args) {
>>   new Foo().doIt();
>> }
>>
>> ... but I guess I'm saying it did not seem to be written that way?
>> Things were passed around as method args when they could otherwise be
>> instance members. So it looked like the intent was a static method
>> anyhow.
>>
>

Re: [jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  Exactly my point. Current clustering and a lot of other drivers don't 
call ToolRunner in their main method; they do new Driver().run(). This 
needs to be changed everywhere. The job() methods currently create new 
Configuration objects since they are invoked mostly from Java in unit 
tests and layered jobs (e.g. synthetic control). I've got a version of 
Canopy that does call ToolRunner and it does return a populated 
Configuration from getConf() but, since the job methods are now static, 
they can't call it; it needs to be an explicit argument. So, I've added 
conf as the first parameter to job() (and left a convenience version 
without it), and that seems to work.

Now I'm trying to use a -D argument to set a configuration parameter but 
the parser won't accept it. I've tried -D foo.bar.baz=11 and 
-Dfoo.bar.baz=11 with no joy on either. What is the correct syntax?

On the separate question of explicit numReducers arguments to the Java 
methods and the CLI I'm all for doing it consistently. It's more work 
for Java callers to create and set the conf parameter than it is with an 
explicit argument but most current callers would use the convenience 
method anyway.

On the static conversions themselves, new Foo().run() is how they used 
to do it but, as you noted earlier, it should be ToolRunner.run(class, 
conf, args) anyway. Since run() *is* an instance method it seemed more 
correct to have the methods it called also be instance methods. In 
clustering, the methods used to be static when I wrote them so I can't 
claim to be an OO purist, though I still don't like them. Just trying to 
sort out the motivation for the change: was this PMD, Checkstyle, or 
Seanstyle <g>?

On 9/22/10 1:53 PM, Sean Owen wrote:
> Let me try
>
> On Wed, Sep 22, 2010 at 3:32 PM, Jeff Eastman
> <jd...@windwardsolutions.com>  wrote:
>>   The clustering drivers all call new Configuration() in their
>> implementations. When run only from the CLI, other Mahout jobs call
>> getConf() which is where the -D arguments get pulled in (right?). So there
> This comes from using ToolRunner.run(). It sets up all those args, and
> then calls Tool.run(). So when you implement Tool, in run(), the
> result of getConf() has all that stuff.
>
> Inside, it's org.apache.hadoop.util.GenericOptionsParser that does that work.
>
> I think your point is that this doesn't hold up for the case of
> invoking from some arbitrary Java calling code. Yes, in that case, the
> caller might have to populate a Configuration object (or be able to
> modify it) to pass this sort of setting. At least that's how I'd play
> it.
>
> But then the question of adding a new command-line argument doesn't
> help this use case anyway.
>
> Am I following?
>
>
>> And what was the PMD/Checkstyle problem with instance methods on the drivers
>> that motivated the regression to statics? I hate statics.
> The reasoning was simply that the methods used no instance methods or
> members. It was already "really" a static method.
>
> I have little problem with the hard-line OO approach that even such
> Driver classes ought to be full of instance methods anyway, and
> perhaps have this bit of glue to the non-object-oriented world at the
> end:
>
> public static void main(String[] args) {
>    new Foo().doIt();
> }
>
> ... but I guess I'm saying it did not seem to be written that way?
> Things were passed around as method args when they could otherwise be
> instance members. So it looked like the intent was a static method
> anyhow.
>

Re: [jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Posted by Sean Owen <sr...@gmail.com>.
Let me try

On Wed, Sep 22, 2010 at 3:32 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
>  The clustering drivers all call new Configuration() in their
> implementations. When run only from the CLI, other Mahout jobs call
> getConf() which is where the -D arguments get pulled in (right?). So there

This comes from using ToolRunner.run(). It sets up all those args, and
then calls Tool.run(). So when you implement Tool, in run(), the
result of getConf() has all that stuff.

Inside, it's org.apache.hadoop.util.GenericOptionsParser that does that work.

I think your point is that this doesn't hold up for the case of
invoking from some arbitrary Java calling code. Yes, in that case, the
caller might have to populate a Configuration object (or be able to
modify it) to pass this sort of setting. At least that's how I'd play
it.

But then the question of adding a new command-line argument doesn't
help this use case anyway.

Am I following?


> And what was the PMD/Checkstyle problem with instance methods on the drivers
> that motivated the regression to statics? I hate statics.

The reasoning was simply that the methods used no instance methods or
members. It was already "really" a static method.

I have little problem with the hard-line OO approach that even such
Driver classes ought to be full of instance methods anyway, and
perhaps have this bit of glue to the non-object-oriented world at the
end:

public static void main(String[] args) {
  new Foo().doIt();
}

... but I guess I'm saying it did not seem to be written that way?
Things were passed around as method args when they could otherwise be
instance members. So it looked like the intent was a static method
anyhow.



>
> On 9/22/10 10:18 AM, Sean Owen wrote:
>>
>> Oh this smells like a solvable problem for sure.
>>
>> The Job eventually has a Configuration object; what exactly is the
>> flow where it doesn't? Surely that is fixable. That should run around
>> with the Job, and within that you can set whatever you like. Shouldn't
>> need more API changes.
>>
>> I don't see what the static-ness has to do with it then?
>>
>> On Wed, Sep 22, 2010 at 2:52 PM, Jeff Eastman
>> <jd...@windwardsolutions.com>  wrote:
>>>
>>>  What you say is true from the command line, but currently there is no
>>> way
>>> except via explicit arguments to control this from Java drivers. The
>>> run()
>>> commands get a Configuration from AbstractJob via getConf() but this
>>> returns
>>> null when calling from Java. I guess we could change the job/run methods
>>> to
>>> accept a configuration argument and in place of the numReducers.
>>>
>>> The clustering drivers create a new configuration in those methods (not
>>> calling getConf()) right now, setting the job parameters from explicit
>>> arguments. I'll take a look at refactoring this and see if there is time
>>> to
>>> do it by end of next week. Probably is, if this is at the top of my list,
>>> but I will check.
>>>
>>> Actually, you changed all the clustering driver methods back to statics
>>> while fixing PMD/Checkstyle issues (r990892) and so getConf() cannot even
>>> be
>>> called from them!
>>>
>>> On 9/22/10 3:12 AM, Sean Owen (JIRA) wrote:
>>>>
>>>>     [
>>>>
>>>> https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913434#action_12913434
>>>> ]
>>>>
>>>> Sean Owen commented on MAHOUT-414:
>>>> ----------------------------------
>>>>
>>>> I tend to think this is, in fact, a Hadoop-level configuration. At times
>>>> a
>>>> job may wish to force concurrency -- 1 job only when it knows there is
>>>> no
>>>> parallelism available, or 2x more reducers than mappers when that's
>>>> known to
>>>> be good.
>>>>
>>>> Users can control this already via Hadoop. Letting them control it via
>>>> duplicate command line parameters doesn't add that. I agree, it's
>>>> sometimes
>>>> hard to know how to set parallelism, though Hadoop's guesses are good.
>>>>
>>>> When I see Hadoop's guesses are too low, it's because input is too small
>>>> to create enough input shards. This is a different issue.
>>>>
>>>> So I guess I'm wondering what the concrete change here could be, for
>>>> discussion? since it's marked as 0.4.
>>>>
>>>>> Usability: Mahout applications need a consistent API to allow users to
>>>>> specify desired map/reduce concurrency
>>>>>
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------
>>>>>
>>>>>                 Key: MAHOUT-414
>>>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-414
>>>>>             Project: Mahout
>>>>>          Issue Type: Bug
>>>>>    Affects Versions: 0.3
>>>>>            Reporter: Jeff Eastman
>>>>>             Fix For: 0.4
>>>>>
>>>>>
>>>>> If specifying the number of mappers and reducers is a common activity
>>>>> which users need to perform in running Mahout applications on Hadoop
>>>>> clusters then we need to have a standard way of specifying them in our
>>>>> APIs
>>>>> without exposing the full set of Hadoop options, especially for our
>>>>> non-power-users. This is the case for some applications already but
>>>>> others
>>>>> require the use of Hadoop-level -D arguments to achieve reasonable
>>>>> out-of-the-box parallelism even when running our examples. The
>>>>> usability
>>>>> defect is that some of our algorithms won't scale without it and that
>>>>> we
>>>>> don't have a standard way to express this in our APIs.
>>>
>
>

Re: [jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  The clustering drivers all call new Configuration() in their 
implementations. When run only from the CLI, other Mahout jobs call 
getConf() which is where the -D arguments get pulled in (right?). So 
there is no way to set the Hadoop parameters when calling the static 
driver methods from Java programs. This is because getConf() cannot be 
called at all. Even with the instance versions of the methods, it would 
return null unless called from the CLI.

And what was the PMD/Checkstyle problem with instance methods on the 
drivers that motivated the regression to statics? I hate statics.

On 9/22/10 10:18 AM, Sean Owen wrote:
> Oh this smells like a solvable problem for sure.
>
> The Job eventually has a Configuration object; what exactly is the
> flow where it doesn't? Surely that is fixable. That should run around
> with the Job, and within that you can set whatever you like. Shouldn't
> need more API changes.
>
> I don't see what the static-ness has to do with it then?
>
> On Wed, Sep 22, 2010 at 2:52 PM, Jeff Eastman
> <jd...@windwardsolutions.com>  wrote:
>>   What you say is true from the command line, but currently there is no way
>> except via explicit arguments to control this from Java drivers. The run()
>> commands get a Configuration from AbstractJob via getConf() but this returns
>> null when calling from Java. I guess we could change the job/run methods to
>> accept a configuration argument and in place of the numReducers.
>>
>> The clustering drivers create a new configuration in those methods (not
>> calling getConf()) right now, setting the job parameters from explicit
>> arguments. I'll take a look at refactoring this and see if there is time to
>> do it by end of next week. Probably is, if this is at the top of my list,
>> but I will check.
>>
>> Actually, you changed all the clustering driver methods back to statics
>> while fixing PMD/Checkstyle issues (r990892) and so getConf() cannot even be
>> called from them!
>>
>> On 9/22/10 3:12 AM, Sean Owen (JIRA) wrote:
>>>      [
>>> https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913434#action_12913434
>>> ]
>>>
>>> Sean Owen commented on MAHOUT-414:
>>> ----------------------------------
>>>
>>> I tend to think this is, in fact, a Hadoop-level configuration. At times a
>>> job may wish to force concurrency -- 1 job only when it knows there is no
>>> parallelism available, or 2x more reducers than mappers when that's known to
>>> be good.
>>>
>>> Users can control this already via Hadoop. Letting them control it via
>>> duplicate command line parameters doesn't add that. I agree, it's sometimes
>>> hard to know how to set parallelism, though Hadoop's guesses are good.
>>>
>>> When I see Hadoop's guesses are too low, it's because input is too small
>>> to create enough input shards. This is a different issue.
>>>
>>> So I guess I'm wondering what the concrete change here could be, for
>>> discussion? since it's marked as 0.4.
>>>
>>>> Usability: Mahout applications need a consistent API to allow users to
>>>> specify desired map/reduce concurrency
>>>>
>>>> -------------------------------------------------------------------------------------------------------------
>>>>
>>>>                  Key: MAHOUT-414
>>>>                  URL: https://issues.apache.org/jira/browse/MAHOUT-414
>>>>              Project: Mahout
>>>>           Issue Type: Bug
>>>>     Affects Versions: 0.3
>>>>             Reporter: Jeff Eastman
>>>>              Fix For: 0.4
>>>>
>>>>
>>>> If specifying the number of mappers and reducers is a common activity
>>>> which users need to perform in running Mahout applications on Hadoop
>>>> clusters then we need to have a standard way of specifying them in our APIs
>>>> without exposing the full set of Hadoop options, especially for our
>>>> non-power-users. This is the case for some applications already but others
>>>> require the use of Hadoop-level -D arguments to achieve reasonable
>>>> out-of-the-box parallelism even when running our examples. The usability
>>>> defect is that some of our algorithms won't scale without it and that we
>>>> don't have a standard way to express this in our APIs.
>>


Re: [jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Posted by Sean Owen <sr...@gmail.com>.
Oh this smells like a solvable problem for sure.

The Job eventually has a Configuration object; what exactly is the
flow where it doesn't? Surely that is fixable. That should run around
with the Job, and within that you can set whatever you like. Shouldn't
need more API changes.

I don't see what the static-ness has to do with it then?

On Wed, Sep 22, 2010 at 2:52 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
>  What you say is true from the command line, but currently there is no way
> except via explicit arguments to control this from Java drivers. The run()
> commands get a Configuration from AbstractJob via getConf() but this returns
> null when calling from Java. I guess we could change the job/run methods to
> accept a configuration argument and in place of the numReducers.
>
> The clustering drivers create a new configuration in those methods (not
> calling getConf()) right now, setting the job parameters from explicit
> arguments. I'll take a look at refactoring this and see if there is time to
> do it by end of next week. Probably is, if this is at the top of my list,
> but I will check.
>
> Actually, you changed all the clustering driver methods back to statics
> while fixing PMD/Checkstyle issues (r990892) and so getConf() cannot even be
> called from them!
>
> On 9/22/10 3:12 AM, Sean Owen (JIRA) wrote:
>>
>>     [
>> https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913434#action_12913434
>> ]
>>
>> Sean Owen commented on MAHOUT-414:
>> ----------------------------------
>>
>> I tend to think this is, in fact, a Hadoop-level configuration. At times a
>> job may wish to force concurrency -- 1 job only when it knows there is no
>> parallelism available, or 2x more reducers than mappers when that's known to
>> be good.
>>
>> Users can control this already via Hadoop. Letting them control it via
>> duplicate command line parameters doesn't add that. I agree, it's sometimes
>> hard to know how to set parallelism, though Hadoop's guesses are good.
>>
>> When I see Hadoop's guesses are too low, it's because input is too small
>> to create enough input shards. This is a different issue.
>>
>> So I guess I'm wondering what the concrete change here could be, for
>> discussion? since it's marked as 0.4.
>>
>>> Usability: Mahout applications need a consistent API to allow users to
>>> specify desired map/reduce concurrency
>>>
>>> -------------------------------------------------------------------------------------------------------------
>>>
>>>                 Key: MAHOUT-414
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-414
>>>             Project: Mahout
>>>          Issue Type: Bug
>>>    Affects Versions: 0.3
>>>            Reporter: Jeff Eastman
>>>             Fix For: 0.4
>>>
>>>
>>> If specifying the number of mappers and reducers is a common activity
>>> which users need to perform in running Mahout applications on Hadoop
>>> clusters then we need to have a standard way of specifying them in our APIs
>>> without exposing the full set of Hadoop options, especially for our
>>> non-power-users. This is the case for some applications already but others
>>> require the use of Hadoop-level -D arguments to achieve reasonable
>>> out-of-the-box parallelism even when running our examples. The usability
>>> defect is that some of our algorithms won't scale without it and that we
>>> don't have a standard way to express this in our APIs.
>
>

Re: [jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  What you say is true from the command line, but currently there is no 
way except via explicit arguments to control this from Java drivers. The 
run() commands get a Configuration from AbstractJob via getConf() but 
this returns null when calling from Java. I guess we could change the 
job/run methods to accept a configuration argument and in place of the 
numReducers.

The clustering drivers create a new configuration in those methods (not 
calling getConf()) right now, setting the job parameters from explicit 
arguments. I'll take a look at refactoring this and see if there is time 
to do it by end of next week. Probably is, if this is at the top of my 
list, but I will check.

Actually, you changed all the clustering driver methods back to statics 
while fixing PMD/Checkstyle issues (r990892) and so getConf() cannot 
even be called from them!

On 9/22/10 3:12 AM, Sean Owen (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913434#action_12913434 ]
>
> Sean Owen commented on MAHOUT-414:
> ----------------------------------
>
> I tend to think this is, in fact, a Hadoop-level configuration. At times a job may wish to force concurrency -- 1 job only when it knows there is no parallelism available, or 2x more reducers than mappers when that's known to be good.
>
> Users can control this already via Hadoop. Letting them control it via duplicate command line parameters doesn't add that. I agree, it's sometimes hard to know how to set parallelism, though Hadoop's guesses are good.
>
> When I see Hadoop's guesses are too low, it's because input is too small to create enough input shards. This is a different issue.
>
> So I guess I'm wondering what the concrete change here could be, for discussion? since it's marked as 0.4.
>
>> Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency
>> -------------------------------------------------------------------------------------------------------------
>>
>>                  Key: MAHOUT-414
>>                  URL: https://issues.apache.org/jira/browse/MAHOUT-414
>>              Project: Mahout
>>           Issue Type: Bug
>>     Affects Versions: 0.3
>>             Reporter: Jeff Eastman
>>              Fix For: 0.4
>>
>>
>> If specifying the number of mappers and reducers is a common activity which users need to perform in running Mahout applications on Hadoop clusters then we need to have a standard way of specifying them in our APIs without exposing the full set of Hadoop options, especially for our non-power-users. This is the case for some applications already but others require the use of Hadoop-level -D arguments to achieve reasonable out-of-the-box parallelism even when running our examples. The usability defect is that some of our algorithms won't scale without it and that we don't have a standard way to express this in our APIs.


[jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913434#action_12913434 ] 

Sean Owen commented on MAHOUT-414:
----------------------------------

I tend to think this is, in fact, a Hadoop-level configuration. At times a job may wish to force concurrency -- 1 job only when it knows there is no parallelism available, or 2x more reducers than mappers when that's known to be good.

Users can control this already via Hadoop. Letting them control it via duplicate command line parameters doesn't add that. I agree, it's sometimes hard to know how to set parallelism, though Hadoop's guesses are good.

When I see Hadoop's guesses are too low, it's because input is too small to create enough input shards. This is a different issue.

So I guess I'm wondering what the concrete change here could be, for discussion? since it's marked as 0.4.

> Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-414
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-414
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Jeff Eastman
>             Fix For: 0.4
>
>
> If specifying the number of mappers and reducers is a common activity which users need to perform in running Mahout applications on Hadoop clusters then we need to have a standard way of specifying them in our APIs without exposing the full set of Hadoop options, especially for our non-power-users. This is the case for some applications already but others require the use of Hadoop-level -D arguments to achieve reasonable out-of-the-box parallelism even when running our examples. The usability defect is that some of our algorithms won't scale without it and that we don't have a standard way to express this in our APIs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913902#action_12913902 ] 

Hudson commented on MAHOUT-414:
-------------------------------

Integrated in Mahout-Quality #314 (See [https://hudson.apache.org/hudson/job/Mahout-Quality/314/])
    MAHOUT-414: Added configuration arguments to clustering drivers and added
getConf() calls to pick up CLI arguments. Removed numReducers arguments and
deprecated DefaultOptionsCreator.numReducersOption. Adjusted main methods to use ToolRunner. Fixed unit tests. All tests run.


> Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-414
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-414
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Jeff Eastman
>             Fix For: 0.4
>
>
> If specifying the number of mappers and reducers is a common activity which users need to perform in running Mahout applications on Hadoop clusters then we need to have a standard way of specifying them in our APIs without exposing the full set of Hadoop options, especially for our non-power-users. This is the case for some applications already but others require the use of Hadoop-level -D arguments to achieve reasonable out-of-the-box parallelism even when running our examples. The usability defect is that some of our algorithms won't scale without it and that we don't have a standard way to express this in our APIs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (MAHOUT-414) Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman resolved MAHOUT-414.
---------------------------------

      Assignee: Jeff Eastman
    Resolution: Fixed

All the clustering applications now use AbstractJob which supports the -D arguments for configuring Hadoop. All now call getConf() so that this parameter is handled correctly from the CLI, and numReducers option has been removed. Marking as closed.

> Usability: Mahout applications need a consistent API to allow users to specify desired map/reduce concurrency
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-414
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-414
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.4
>
>
> If specifying the number of mappers and reducers is a common activity which users need to perform in running Mahout applications on Hadoop clusters then we need to have a standard way of specifying them in our APIs without exposing the full set of Hadoop options, especially for our non-power-users. This is the case for some applications already but others require the use of Hadoop-level -D arguments to achieve reasonable out-of-the-box parallelism even when running our examples. The usability defect is that some of our algorithms won't scale without it and that we don't have a standard way to express this in our APIs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.