You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Drew Farris <dr...@gmail.com> on 2010/05/28 18:16:18 UTC

Re: --input now -Dmapred.input.dir ?

-user@m.a.org +dev@m.a.org

It might be nice to add a few default flags to AbstractJob that map directly
to -D arguments in hadoop, for example, I could see having -i map to
-Dmapred.input.dir, -o to -Dmapred.output.dir, -nr -Dmapred.num.reducers
etc.. I think it is great to be able to accept arbitrary -D arguments but it
would be nice to accept shorthand which also gets displayed in -h output.

The -D options don't get included in -h and as a result it is unclear just
how to specify input or output to someone who might not be too familliar
with hadoop conventions.

>From the API perspective, AbstractJob could provide no-arg methods like
AbstractJob.buildInputOption() etc, where the class using the AbstractJob
api need not be concerned with the precise letters, parameters, description
required for the option.

Tangentially related, I was wondering something about AbstractJob: With the
advent of the parsedArgs map returned by AbstractJob.parseArguments is there
a need to pass Option arguments around anymore? Could AbstractJob maintain
Options state in a sense?

For example, from RecommenderJob:

    Option numReccomendationsOpt =
AbstractJob.buildOption("numRecommendations", "n",
      "Number of recommendations per user", "10");
    Option usersFileOpt = AbstractJob.buildOption("usersFile", "u",
      "File of users to recommend for", null);
    Option booleanDataOpt = AbstractJob.buildOption("booleanData", "b",
      "Treat input as without pref values", Boolean.FALSE.toString());

    Map<String,String> parsedArgs = AbstractJob.parseArguments(
        args, numReccomendationsOpt, usersFileOpt, booleanDataOpt);
    if (parsedArgs == null) {
      return -1;
    }

Could be changed to something like:

buildOption("numRecommendations", "n", "Number of recommendations per user",
"10");
buildOption("usersFile", "u", "File of users to recommend for", null);
buildOption("booleanData", "b", "Treat input as without pref values",
Boolean.FALSE.toString());
Map<String,String> parsedArgs = parseArguments();

Providing a set of input validators that check the input before launching a
job sounds like a pretty cool idea too.

On Fri, May 28, 2010 at 10:55 AM, Sean Owen <sr...@gmail.com> wrote:

> Does it help to note this is Hadoop's flag? It seemed more standard
> therefore,  possibly more intuitive for some already using Hadoop. We were
> starting to reinvent many flags this way so seemed better to not thunk them
> with no gain
>
> On May 28, 2010 6:06 AM, "Grant Ingersoll" <gs...@apache.org> wrote:
>
> I just saw that too, and it seems like a loss to me.  We did a lot of work
> to be consistent on this and have a lot of documentation out there with it
> in it.  -Dmapred.input.dir is so much less intuitive than -i or --input.
>
> -Grant
>
>
> On May 27, 2010, at 9:04 PM, Jake Mannix wrote:
>
> > Is that right? I think the mahout shell script ...
>

Re: bug in FileDataModel

Posted by Sean Owen <sr...@gmail.com>.

Could be missing something but I don't follow. It is indeed loading
new data, but, there may already be preferences from that user in the
new data at this point. These are the cases being handled here.

On Mon, Jun 28, 2010 at 9:23 PM, Tamas Jambor <ja...@gmail.com> wrote:
> one more thing I noticed. from line 457-484, the whole stuff could be
> removed, since you are working with fresh data, so those situations don't
> really apply, I think
>

Re: bug in FileDataModel

Posted by Tamas Jambor <ja...@gmail.com>.

one more thing I noticed. from line 457-484, the whole stuff could be
removed, since you are working with fresh data, so those situations don't
really apply, I think

On Mon, Jun 28, 2010 at 8:23 PM, Sean Owen <sr...@gmail.com> wrote:

> Yah good one, that looks wrong. I will fix that now (and I think a
> similar problem later in the file.)
>
> On Sun, Jun 27, 2010 at 11:41 PM, Tamas Jambor <ja...@gmail.com> wrote:
> > hi Sean,
> >
> > I might have spotted a bug in FileDataModel, in line 444, you create a
> > PreferenceArray newPrefs and then shift the existing values one step
> down,
> > but in the end this updated object never gets written back to the main
> data
> > object.
> >
> > Tamas
> >
>

Re: bug in FileDataModel

Posted by Sean Owen <sr...@gmail.com>.

Yah good one, that looks wrong. I will fix that now (and I think a
similar problem later in the file.)

On Sun, Jun 27, 2010 at 11:41 PM, Tamas Jambor <ja...@gmail.com> wrote:
> hi Sean,
>
> I might have spotted a bug in FileDataModel, in line 444, you create a
> PreferenceArray newPrefs and then shift the existing values one step down,
> but in the end this updated object never gets written back to the main data
> object.
>
> Tamas
>

bug in FileDataModel

Posted by Tamas Jambor <ja...@gmail.com>.

hi Sean,

I might have spotted a bug in FileDataModel, in line 444, you create a 
PreferenceArray newPrefs and then shift the existing values one step 
down, but in the end this updated object never gets written back to the 
main data object.

Tamas

Re: --input now -Dmapred.input.dir ?

Posted by Grant Ingersoll <gs...@apache.org>.

+1

On May 28, 2010, at 2:26 PM, Ted Dunning wrote:

> I think that we need to handle both approaches for the traditional -D stuff
> from hadoop.  That way the help output will suggest a way for the naive user
> to succeed and hadoop users who assume that they can use hadoop -D options
> will also succeed.
> 
> On Fri, May 28, 2010 at 11:19 AM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> So, in some parts, I need to have Hadoop options configured (presumably
>> either in a Conf file or via -D) while other inputs
>> I'm going to put in with the traditional -- stuff.
>>

Re: --input now -Dmapred.input.dir ?

Posted by Sean Owen <sr...@gmail.com>.

Agree, we need to be used to the idea of passing both types of args.
For example Hadoop args let you control the number of
mappers/reducers, and users will have to control that, and maybe 10
other options I can rattle off.

We shouldn't, and can't, disable the arguments that Hadoop users would
be used to.

We don't want to duplicate all of them with custom params, but, for
really common args (like, required ones) I see the utility. So sure
add back -i etc.

On Fri, May 28, 2010 at 2:26 PM, Ted Dunning <te...@gmail.com> wrote:
> I think that we need to handle both approaches for the traditional -D stuff
> from hadoop.  That way the help output will suggest a way for the naive user
> to succeed and hadoop users who assume that they can use hadoop -D options
> will also succeed.

Re: --input now -Dmapred.input.dir ?

Posted by Ted Dunning <te...@gmail.com>.

I think that we need to handle both approaches for the traditional -D stuff
from hadoop.  That way the help output will suggest a way for the naive user
to succeed and hadoop users who assume that they can use hadoop -D options
will also succeed.

On Fri, May 28, 2010 at 11:19 AM, Grant Ingersoll <gs...@apache.org>wrote:

> So, in some parts, I need to have Hadoop options configured (presumably
> either in a Conf file or via -D) while other inputs
> I'm going to put in with the traditional -- stuff.
>

Re: --input now -Dmapred.input.dir ?

Posted by Grant Ingersoll <gs...@apache.org>.

This seems really confusing to me:

<snip from="RecommenderJob">
Option numReccomendationsOpt = AbstractJob.buildOption("numRecommendations", "n",
      "Number of recommendations per user", "10");
    Option usersFileOpt = AbstractJob.buildOption("usersFile", "u",
      "File of users to recommend for", null);
    Option booleanDataOpt = AbstractJob.buildOption("booleanData", "b",
      "Treat input as without pref values", Boolean.FALSE.toString());

    Map<String,String> parsedArgs = AbstractJob.parseArguments(
        args, numReccomendationsOpt, usersFileOpt, booleanDataOpt);
    if (parsedArgs == null) {
      return -1;
    }
    
    Configuration originalConf = getConf();
    Path inputPath = new Path(originalConf.get("mapred.input.dir"));
    Path outputPath = new Path(originalConf.get("mapred.output.dir"));
</snip>

So, in some parts, I need to have Hadoop options configured (presumably either in a Conf file or via -D) while other inputs
I'm going to put in with the traditional -- stuff.    

-Grant


On May 28, 2010, at 2:05 PM, Sean Owen wrote:

> I'm for all of those ideas. Would be great if someone else makes changes to
> make it more broadly usable since so far its just structure I have chucked
> in.
> 
> On May 28, 2010 12:16 PM, "Drew Farris" <dr...@gmail.com> wrote:
> 
> -user@m.a.org +dev@m.a.org
> 
> It might be nice to add a few default flags to AbstractJob that map directly
> to -D arguments in hadoop, for example, I could see having -i map to
> -Dmapred.input.dir, -o to -Dmapred.output.dir, -nr -Dmapred.num.reducers
> etc.. I think it is great to be able to accept arbitrary -D arguments but it
> would be nice to accept shorthand which also gets displayed in -h output.
> 
> The -D options don't get included in -h and as a result it is unclear just
> how to specify input or output to someone who might not be too familliar
> with hadoop conventions.
> 
> From the API perspective, AbstractJob could provide no-arg methods like
> AbstractJob.buildInputOption() etc, where the class using the AbstractJob
> api need not be concerned with the precise letters, parameters, description
> required for the option.
> 
> Tangentially related, I was wondering something about AbstractJob: With the
> advent of the parsedArgs map returned by AbstractJob.parseArguments is there
> a need to pass Option arguments around anymore? Could AbstractJob maintain
> Options state in a sense?
> 
> For example, from RecommenderJob:
> 
>   Option numReccomendationsOpt =
> AbstractJob.buildOption("numRecommendations", "n",
>     "Number of recommendations per user", "10");
>   Option usersFileOpt = AbstractJob.buildOption("usersFile", "u",
>     "File of users to recommend for", null);
>   Option booleanDataOpt = AbstractJob.buildOption("booleanData", "b",
>     "Treat input as without pref values", Boolean.FALSE.toString());
> 
>   Map<String,String> parsedArgs = AbstractJob.parseArguments(
>       args, numReccomendationsOpt, usersFileOpt, booleanDataOpt);
>   if (parsedArgs == null) {
>     return -1;
>   }
> 
> Could be changed to something like:
> 
> buildOption("numRecommendations", "n", "Number of recommendations per user",
> "10");
> buildOption("usersFile", "u", "File of users to recommend for", null);
> buildOption("booleanData", "b", "Treat input as without pref values",
> Boolean.FALSE.toString());
> Map<String,String> parsedArgs = parseArguments();
> 
> Providing a set of input validators that check the input before launching a
> job sounds like a pretty cool idea too.
> 
> 
> On Fri, May 28, 2010 at 10:55 AM, Sean Owen <sr...@gmail.com> wrote:
> 
>> Does it help to note this ...

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

Re: --input now -Dmapred.input.dir ?

Posted by Sean Owen <sr...@gmail.com>.

I'm for all of those ideas. Would be great if someone else makes changes to
make it more broadly usable since so far its just structure I have chucked
in.

On May 28, 2010 12:16 PM, "Drew Farris" <dr...@gmail.com> wrote:

-user@m.a.org +dev@m.a.org

It might be nice to add a few default flags to AbstractJob that map directly
to -D arguments in hadoop, for example, I could see having -i map to
-Dmapred.input.dir, -o to -Dmapred.output.dir, -nr -Dmapred.num.reducers
etc.. I think it is great to be able to accept arbitrary -D arguments but it
would be nice to accept shorthand which also gets displayed in -h output.

The -D options don't get included in -h and as a result it is unclear just
how to specify input or output to someone who might not be too familliar
with hadoop conventions.

>From the API perspective, AbstractJob could provide no-arg methods like
AbstractJob.buildInputOption() etc, where the class using the AbstractJob
api need not be concerned with the precise letters, parameters, description
required for the option.

Tangentially related, I was wondering something about AbstractJob: With the
advent of the parsedArgs map returned by AbstractJob.parseArguments is there
a need to pass Option arguments around anymore? Could AbstractJob maintain
Options state in a sense?

For example, from RecommenderJob:

   Option numReccomendationsOpt =
AbstractJob.buildOption("numRecommendations", "n",
     "Number of recommendations per user", "10");
   Option usersFileOpt = AbstractJob.buildOption("usersFile", "u",
     "File of users to recommend for", null);
   Option booleanDataOpt = AbstractJob.buildOption("booleanData", "b",
     "Treat input as without pref values", Boolean.FALSE.toString());

   Map<String,String> parsedArgs = AbstractJob.parseArguments(
       args, numReccomendationsOpt, usersFileOpt, booleanDataOpt);
   if (parsedArgs == null) {
     return -1;
   }

Could be changed to something like:

buildOption("numRecommendations", "n", "Number of recommendations per user",
"10");
buildOption("usersFile", "u", "File of users to recommend for", null);
buildOption("booleanData", "b", "Treat input as without pref values",
Boolean.FALSE.toString());
Map<String,String> parsedArgs = parseArguments();

Providing a set of input validators that check the input before launching a
job sounds like a pretty cool idea too.

On Fri, May 28, 2010 at 10:55 AM, Sean Owen <sr...@gmail.com> wrote:

> Does it help to note this ...

Re: --input now -Dmapred.input.dir ?

Posted by Grant Ingersoll <gs...@apache.org>.

On May 28, 2010, at 12:16 PM, Drew Farris wrote:

> -user@m.a.org +dev@m.a.org
> 
> It might be nice to add a few default flags to AbstractJob that map directly
> to -D arguments in hadoop, for example, I could see having -i map to
> -Dmapred.input.dir, -o to -Dmapred.output.dir, -nr -Dmapred.num.reducers
> etc.. I think it is great to be able to accept arbitrary -D arguments but it
> would be nice to accept shorthand which also gets displayed in -h output.
> 

+1.  Think of the users...  Plus, we have a lot of docs already that use this.

> The -D options don't get included in -h and as a result it is unclear just
> how to specify input or output to someone who might not be too familliar
> with hadoop conventions.

Besides, the Hadoop conventions are cumbersome.  Just b/c they do something in a non-obvious way doesn't mean we need to.

To some extent, as Hadoop gets easier to use, there is no reason why anyone need even know we are using Hadoop.  I don't think
we should tie our public interfaces (and the CLI is our primary public interface) to Hadoop.

> 
> From the API perspective, AbstractJob could provide no-arg methods like
> AbstractJob.buildInputOption() etc, where the class using the AbstractJob
> api need not be concerned with the precise letters, parameters, description
> required for the option.
> 
> Tangentially related, I was wondering something about AbstractJob: With the
> advent of the parsedArgs map returned by AbstractJob.parseArguments is there
> a need to pass Option arguments around anymore? Could AbstractJob maintain
> Options state in a sense?
> 
> For example, from RecommenderJob:
> 
>    Option numReccomendationsOpt =
> AbstractJob.buildOption("numRecommendations", "n",
>      "Number of recommendations per user", "10");
>    Option usersFileOpt = AbstractJob.buildOption("usersFile", "u",
>      "File of users to recommend for", null);
>    Option booleanDataOpt = AbstractJob.buildOption("booleanData", "b",
>      "Treat input as without pref values", Boolean.FALSE.toString());
> 
>    Map<String,String> parsedArgs = AbstractJob.parseArguments(
>        args, numReccomendationsOpt, usersFileOpt, booleanDataOpt);
>    if (parsedArgs == null) {
>      return -1;
>    }
> 
> Could be changed to something like:
> 
> buildOption("numRecommendations", "n", "Number of recommendations per user",
> "10");
> buildOption("usersFile", "u", "File of users to recommend for", null);
> buildOption("booleanData", "b", "Treat input as without pref values",
> Boolean.FALSE.toString());
> Map<String,String> parsedArgs = parseArguments();
> 
> Providing a set of input validators that check the input before launching a
> job sounds like a pretty cool idea too.

Seems nice to me.

> 
> On Fri, May 28, 2010 at 10:55 AM, Sean Owen <sr...@gmail.com> wrote:
> 
>> Does it help to note this is Hadoop's flag? It seemed more standard
>> therefore,  possibly more intuitive for some already using Hadoop. We were
>> starting to reinvent many flags this way so seemed better to not thunk them
>> with no gain
>> 
>> On May 28, 2010 6:06 AM, "Grant Ingersoll" <gs...@apache.org> wrote:
>> 
>> I just saw that too, and it seems like a loss to me.  We did a lot of work
>> to be consistent on this and have a lot of documentation out there with it
>> in it.  -Dmapred.input.dir is so much less intuitive than -i or --input.
>> 
>> -Grant
>> 
>> 
>> On May 27, 2010, at 9:04 PM, Jake Mannix wrote:
>> 
>>> Is that right? I think the mahout shell script ...
>>