You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Jeff Eastman <jd...@windwardsolutions.com> on 2012/02/11 20:01:36 UTC

Goals for Mahout 0.7

Now that 0.6 is in the box, it seems a good time to start thinking about 
0.7, from a high level goal perspective at least. Here are a couple that 
come to mind:

  * Target code freeze date August 1, 2012
  * Get Jenkins working for us again
  * Complete clustering refactoring and classification convergence
  * ...


Fwd: Re: Goals for Mahout 0.7

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
+user@

I'd like our users involved in this discussion too.

-------- Original Message --------
Subject: 	Re: Goals for Mahout 0.7
Date: 	Sat, 11 Feb 2012 22:29:02 +0100
From: 	Frank Scholten <fr...@frankscholten.nl>
Reply-To: 	dev@mahout.apache.org
To: 	dev@mahout.apache.org



I'd like to add solving ClassNotFoundException problems with third
party jars in some jobs.

I experimented with having seq2sparse uploading a third party jar with
analyzer and add it to the DistributedCache. Uploading works but
didn't yet get it working inside the Mappers. I have some code lying
around for this that can be used as a starting point, including a
separate project that has dependencies on Mahout and on an analyzer to
test things out.

Another thing would be adding or improving the integration tools. For
example adding a mysql2seq to cluster text from a SQL database.

On Sat, Feb 11, 2012 at 8:01 PM, Jeff Eastman
<jd...@windwardsolutions.com>  wrote:
>  Now that 0.6 is in the box, it seems a good time to start thinking about
>  0.7, from a high level goal perspective at least. Here are a couple that
>  come to mind:
>
>  Target code freeze date August 1, 2012
>  Get Jenkins working for us again
>  Complete clustering refactoring and classification convergence

What kind of clustering refactoring do mean here? I did some work on
creating bean configurations in the past (MAHOUT-612). I
underestimated the amount of work required to do the entire
refactoring. If this can be contributed and committed on a per-job
basis I would like to help out.

>  ...




Re: Goals for Mahout 0.7

Posted by Ted Dunning <te...@gmail.com>.
No problem.

And thank you for being kind when I used language less moderate than
appropriate.

On Thu, Feb 23, 2012 at 8:13 PM, Ioan Eugen Stan <st...@gmail.com>wrote:

> 2012/2/23 Ted Dunning <te...@gmail.com>:
> > Is this a joke?
> >
> >    new String[] {"-t", INPUT_TABLE, "-m", MAIL_ACCOUNT_ID}
> >
> > seems better than farting around with lists.
>
> True, thank you.
>
> --
> Ioan Eugen Stan
> http://ieugen.blogspot.com/
>

Re: Goals for Mahout 0.7

Posted by Ioan Eugen Stan <st...@gmail.com>.
2012/2/23 Ted Dunning <te...@gmail.com>:
> Is this a joke?
>
>    new String[] {"-t", INPUT_TABLE, "-m", MAIL_ACCOUNT_ID}
>
> seems better than farting around with lists.

True, thank you.

-- 
Ioan Eugen Stan
http://ieugen.blogspot.com/

Re: Goals for Mahout 0.7

Posted by Ioan Eugen Stan <st...@gmail.com>.
2012/2/23 Ted Dunning <te...@gmail.com>:
> Is this a joke?
>
>    new String[] {"-t", INPUT_TABLE, "-m", MAIL_ACCOUNT_ID}
>
> seems better than farting around with lists.

True, thank you.

-- 
Ioan Eugen Stan
http://ieugen.blogspot.com/

Re: Goals for Mahout 0.7

Posted by Ted Dunning <te...@gmail.com>.
Is this a joke?

    new String[] {"-t", INPUT_TABLE, "-m", MAIL_ACCOUNT_ID}

seems better than farting around with lists.

On Thu, Feb 23, 2012 at 2:03 PM, Ioan Eugen Stan <st...@gmail.com>wrote:

>
>  String[] args = new String[2];
>> args[0] = "max";
>> args[1] = "7";
>> args[0] = "4";
>> int max = Math.main(args);
>>
>>
> A more elegant solution is:
>
> List<String> argList = new LinkedList<String>();
> argList.add("-t");
> argList.add(INPUT_TABLE);
> argList.add("-m");
> argList.add(MAIL_ACCOUNT_ID);
>
> argList.toArray(new String[ argList.size() ]);
>
>
> Cheers,
>
> --
> Ioan Eugen Stan
> http://ieugen.blogspot.com
>

Re: Goals for Mahout 0.7

Posted by Ted Dunning <te...@gmail.com>.
Is this a joke?

    new String[] {"-t", INPUT_TABLE, "-m", MAIL_ACCOUNT_ID}

seems better than farting around with lists.

On Thu, Feb 23, 2012 at 2:03 PM, Ioan Eugen Stan <st...@gmail.com>wrote:

>
>  String[] args = new String[2];
>> args[0] = "max";
>> args[1] = "7";
>> args[0] = "4";
>> int max = Math.main(args);
>>
>>
> A more elegant solution is:
>
> List<String> argList = new LinkedList<String>();
> argList.add("-t");
> argList.add(INPUT_TABLE);
> argList.add("-m");
> argList.add(MAIL_ACCOUNT_ID);
>
> argList.toArray(new String[ argList.size() ]);
>
>
> Cheers,
>
> --
> Ioan Eugen Stan
> http://ieugen.blogspot.com
>

Re: Goals for Mahout 0.7

Posted by Ioan Eugen Stan <st...@gmail.com>.
> String[] args = new String[2];
> args[0] = "max";
> args[1] = "7";
> args[0] = "4";
> int max = Math.main(args);
>

A more elegant solution is:

List<String> argList = new LinkedList<String>();
argList.add("-t");
argList.add(INPUT_TABLE);
argList.add("-m");
argList.add(MAIL_ACCOUNT_ID);

argList.toArray(new String[ argList.size() ]);


Cheers,

-- 
Ioan Eugen Stan
http://ieugen.blogspot.com

Re: Goals for Mahout 0.7

Posted by Ioan Eugen Stan <st...@gmail.com>.
> String[] args = new String[2];
> args[0] = "max";
> args[1] = "7";
> args[0] = "4";
> int max = Math.main(args);
>

A more elegant solution is:

List<String> argList = new LinkedList<String>();
argList.add("-t");
argList.add(INPUT_TABLE);
argList.add("-m");
argList.add(MAIL_ACCOUNT_ID);

argList.toArray(new String[ argList.size() ]);


Cheers,

-- 
Ioan Eugen Stan
http://ieugen.blogspot.com

Re: Goals for Mahout 0.7

Posted by Geek Gamer <ge...@gmail.com>.
Whenever I used mahout in a project it ended up as an api end point.
Recommendor is the easiest to integrate into one, but classifiers/lda
are not. This would be a great step. I'll feel glad to pitch in with
my efforts as well.


On Tue, Feb 14, 2012 at 12:21 AM, Ted Dunning <te...@gmail.com> wrote:
> John,
>
> This is well said and is a critical need.
>
> There are some beginnings to this.  The recommender side of the house
> already works the way you say.  The classifier and hashed encoding API's
> are beginning to work that way.  The naive Bayes classifiers pretty much do
> not and the classifier API's are just beginning to have an API-centric form.
>
>
>
> On Mon, Feb 13, 2012 at 5:31 PM, John Conwell <jo...@iamjohn.me> wrote:
>
>> From my perspective, I'd really like to see the Mahout API migrate away
>> from a command line centric design it currently utilizes, and migrate more
>> towards an library centric API design.  I think this would go a long way in
>> getting Mahout adopted into real life commercial applications.
>>
>> While there might be a few algorithm drivers that you interact with by
>> creating an instance of a class, and calling some method(s) on the instance
>> to interact with it (I havent actually seen one like that, but there might
>> be a few), many algorithms are invoked by calling some static function on a
>> class that takes ~37 typed arguments.  Buts whats worse, many drivers are
>> invoked by having to create a String array with ~37 arguments as string
>> values, and calling the static main function on the class.
>>
>> Now I'm not saying that having a static main function available to invoke
>> an algorithm from the command line isn't useful.  It is, when your testing
>> an algorithm.  But once you want to integrate the algorithm into a
>> commercial workflow it kind of sucks.
>>
>> For example, immagine if the API for invoking Math.max was designed the way
>> many of the Mahout algorithms currently are?  You'd have something like
>> this:
>>
>> String[] args = new String[2];
>> args[0] = "max";
>> args[1] = "7";
>> args[0] = "4";
>> int max = Math.main(args);
>>
>> It makes your code a horrible mess and very hard to maintain, as well as
>> very prone to bugs.
>>
>> When I see a bunch of static main functions as the only way to interact
>> with a library, no matter what the quality of the library is, my initial
>> impression is that this has to be some minimally supported effort by a few
>> PhD candidates still in academia, who will drop the project as soon as they
>> graduate.  And while this might not be the case, it is one of the first
>> impressions it gives, and can lead a company to drop the library from
>> consideration before they do any due diligence into its quality and
>> utility.
>>
>> I think as Mahout matures and gets closer to a 1.0 release, this kind of
>> API re-design will become more and more necessary, especially if you want a
>> higher Mahout integration rate into commercial applications and workflows.
>>
>> Also, I hope I dont sound too negative.  I'm very impressed with Mahout and
>> its capabilities.  I really like that there is a well thought out class
>> library of primitives for designing new serial and distributed machine
>> learning algorithms.  And I think it has a high utility for integration
>> into highly visible commercial projects.  But its high level public API
>> really is a barrier to entry when trying to design commercial applications.
>>
>>
>> On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
>> <jd...@windwardsolutions.com>wrote:
>>
>> > We have a couple JIRAs that relate here: We want to factor all the (-cl)
>> > classification steps out of all of the driver classes (MAHOUT-930) and
>> into
>> > a separate job to remove duplicated code; MAHOUT-931 is to add a
>> pluggable
>> > outlier removal capability to this job; and MAHOUT-933 is aimed at
>> > factoring all the iteration mechanics from each driver class into the
>> > ClusterIterator, which uses a ClusterClassifier which is itself an
>> > OnlineLearner. This will hopefully allow semi-supervised classifier
>> > applications to be constructed by feeding cluster-derived models into the
>> > classification process. Still kind of fuzzy at this point but promising
>> too.
>> >
>> > On 2/11/12 2:29 PM, Frank Scholten wrote:
>> >
>> >> ...
>> >>
>> >> What kind of clustering refactoring do mean here? I did some work on
>> >> creating bean configurations in the past (MAHOUT-612). I underestimated
>> the
>> >> amount of work required to do the entire refactoring. If this can be
>> >> contributed and committed on a per-job basis I would like to help out.
>> >>
>> >>> ...
>> >>>
>> >>
>> >>
>> >
>>
>>
>> --
>>
>> Thanks,
>> John C
>>

Re: Goals for Mahout 0.7

Posted by Ted Dunning <te...@gmail.com>.
John,

This is well said and is a critical need.

There are some beginnings to this.  The recommender side of the house
already works the way you say.  The classifier and hashed encoding API's
are beginning to work that way.  The naive Bayes classifiers pretty much do
not and the classifier API's are just beginning to have an API-centric form.



On Mon, Feb 13, 2012 at 5:31 PM, John Conwell <jo...@iamjohn.me> wrote:

> From my perspective, I'd really like to see the Mahout API migrate away
> from a command line centric design it currently utilizes, and migrate more
> towards an library centric API design.  I think this would go a long way in
> getting Mahout adopted into real life commercial applications.
>
> While there might be a few algorithm drivers that you interact with by
> creating an instance of a class, and calling some method(s) on the instance
> to interact with it (I havent actually seen one like that, but there might
> be a few), many algorithms are invoked by calling some static function on a
> class that takes ~37 typed arguments.  Buts whats worse, many drivers are
> invoked by having to create a String array with ~37 arguments as string
> values, and calling the static main function on the class.
>
> Now I'm not saying that having a static main function available to invoke
> an algorithm from the command line isn't useful.  It is, when your testing
> an algorithm.  But once you want to integrate the algorithm into a
> commercial workflow it kind of sucks.
>
> For example, immagine if the API for invoking Math.max was designed the way
> many of the Mahout algorithms currently are?  You'd have something like
> this:
>
> String[] args = new String[2];
> args[0] = "max";
> args[1] = "7";
> args[0] = "4";
> int max = Math.main(args);
>
> It makes your code a horrible mess and very hard to maintain, as well as
> very prone to bugs.
>
> When I see a bunch of static main functions as the only way to interact
> with a library, no matter what the quality of the library is, my initial
> impression is that this has to be some minimally supported effort by a few
> PhD candidates still in academia, who will drop the project as soon as they
> graduate.  And while this might not be the case, it is one of the first
> impressions it gives, and can lead a company to drop the library from
> consideration before they do any due diligence into its quality and
> utility.
>
> I think as Mahout matures and gets closer to a 1.0 release, this kind of
> API re-design will become more and more necessary, especially if you want a
> higher Mahout integration rate into commercial applications and workflows.
>
> Also, I hope I dont sound too negative.  I'm very impressed with Mahout and
> its capabilities.  I really like that there is a well thought out class
> library of primitives for designing new serial and distributed machine
> learning algorithms.  And I think it has a high utility for integration
> into highly visible commercial projects.  But its high level public API
> really is a barrier to entry when trying to design commercial applications.
>
>
> On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
> > We have a couple JIRAs that relate here: We want to factor all the (-cl)
> > classification steps out of all of the driver classes (MAHOUT-930) and
> into
> > a separate job to remove duplicated code; MAHOUT-931 is to add a
> pluggable
> > outlier removal capability to this job; and MAHOUT-933 is aimed at
> > factoring all the iteration mechanics from each driver class into the
> > ClusterIterator, which uses a ClusterClassifier which is itself an
> > OnlineLearner. This will hopefully allow semi-supervised classifier
> > applications to be constructed by feeding cluster-derived models into the
> > classification process. Still kind of fuzzy at this point but promising
> too.
> >
> > On 2/11/12 2:29 PM, Frank Scholten wrote:
> >
> >> ...
> >>
> >> What kind of clustering refactoring do mean here? I did some work on
> >> creating bean configurations in the past (MAHOUT-612). I underestimated
> the
> >> amount of work required to do the entire refactoring. If this can be
> >> contributed and committed on a per-job basis I would like to help out.
> >>
> >>> ...
> >>>
> >>
> >>
> >
>
>
> --
>
> Thanks,
> John C
>

Re: Goals for Mahout 0.7

Posted by Ted Dunning <te...@gmail.com>.
John,

This is well said and is a critical need.

There are some beginnings to this.  The recommender side of the house
already works the way you say.  The classifier and hashed encoding API's
are beginning to work that way.  The naive Bayes classifiers pretty much do
not and the classifier API's are just beginning to have an API-centric form.



On Mon, Feb 13, 2012 at 5:31 PM, John Conwell <jo...@iamjohn.me> wrote:

> From my perspective, I'd really like to see the Mahout API migrate away
> from a command line centric design it currently utilizes, and migrate more
> towards an library centric API design.  I think this would go a long way in
> getting Mahout adopted into real life commercial applications.
>
> While there might be a few algorithm drivers that you interact with by
> creating an instance of a class, and calling some method(s) on the instance
> to interact with it (I havent actually seen one like that, but there might
> be a few), many algorithms are invoked by calling some static function on a
> class that takes ~37 typed arguments.  Buts whats worse, many drivers are
> invoked by having to create a String array with ~37 arguments as string
> values, and calling the static main function on the class.
>
> Now I'm not saying that having a static main function available to invoke
> an algorithm from the command line isn't useful.  It is, when your testing
> an algorithm.  But once you want to integrate the algorithm into a
> commercial workflow it kind of sucks.
>
> For example, immagine if the API for invoking Math.max was designed the way
> many of the Mahout algorithms currently are?  You'd have something like
> this:
>
> String[] args = new String[2];
> args[0] = "max";
> args[1] = "7";
> args[0] = "4";
> int max = Math.main(args);
>
> It makes your code a horrible mess and very hard to maintain, as well as
> very prone to bugs.
>
> When I see a bunch of static main functions as the only way to interact
> with a library, no matter what the quality of the library is, my initial
> impression is that this has to be some minimally supported effort by a few
> PhD candidates still in academia, who will drop the project as soon as they
> graduate.  And while this might not be the case, it is one of the first
> impressions it gives, and can lead a company to drop the library from
> consideration before they do any due diligence into its quality and
> utility.
>
> I think as Mahout matures and gets closer to a 1.0 release, this kind of
> API re-design will become more and more necessary, especially if you want a
> higher Mahout integration rate into commercial applications and workflows.
>
> Also, I hope I dont sound too negative.  I'm very impressed with Mahout and
> its capabilities.  I really like that there is a well thought out class
> library of primitives for designing new serial and distributed machine
> learning algorithms.  And I think it has a high utility for integration
> into highly visible commercial projects.  But its high level public API
> really is a barrier to entry when trying to design commercial applications.
>
>
> On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
> > We have a couple JIRAs that relate here: We want to factor all the (-cl)
> > classification steps out of all of the driver classes (MAHOUT-930) and
> into
> > a separate job to remove duplicated code; MAHOUT-931 is to add a
> pluggable
> > outlier removal capability to this job; and MAHOUT-933 is aimed at
> > factoring all the iteration mechanics from each driver class into the
> > ClusterIterator, which uses a ClusterClassifier which is itself an
> > OnlineLearner. This will hopefully allow semi-supervised classifier
> > applications to be constructed by feeding cluster-derived models into the
> > classification process. Still kind of fuzzy at this point but promising
> too.
> >
> > On 2/11/12 2:29 PM, Frank Scholten wrote:
> >
> >> ...
> >>
> >> What kind of clustering refactoring do mean here? I did some work on
> >> creating bean configurations in the past (MAHOUT-612). I underestimated
> the
> >> amount of work required to do the entire refactoring. If this can be
> >> contributed and committed on a per-job basis I would like to help out.
> >>
> >>> ...
> >>>
> >>
> >>
> >
>
>
> --
>
> Thanks,
> John C
>

Re: Goals for Mahout 0.7

Posted by Jake Mannix <ja...@gmail.com>.
On Mon, Feb 13, 2012 at 6:54 PM, Lance Norskog <go...@gmail.com> wrote:
>
> Another option is the hadoop xml file property system; this is how
> values get into mappers&reducers anyway.
>

Please don't say "xml" again, or my early 2000's-era PTSD will kick in.


>
> On Mon, Feb 13, 2012 at 11:11 AM, Jake Mannix <ja...@gmail.com>
> wrote:
> > Hi John,
> >
> >  This is some very good feedback, and warrants serious discussion.  In
> > spite
> > of this, I'm going to respond on the fly with some thoughts in this vein.
> >
> >  We use Mahout at Twitter (the LDA stuff recently put in, and
> > mahout-collections
> > in various places, among other things) in production, and we use it,
> > actually,
> > via command-line invocations of the $MAHOUT_HOME/bin/mahout shell
> > script.  It's invoked in an environment where we keep all of the
> parameters
> > passed in in various (revision controlled) config files and the inputs
> are
> > produced
> > from a series Pig jobs which are invoked in similar ways, and the
> outputs on
> > HDFS are loaded by various and sundry processes in their own ways.
> >
> >  So in general, I totally agree with you that having production *java*
> > apps call
> > into main() methods of other classes is extremely ugly and error-prone.
>   So
> > how would it look to interact via a nice java API to a system which was
> > going
> > to launch some (possibly iterative series of) MapReduce jobs?
> >
> >  I guess I can see how this would go: DistributedLanczosSolver, for
> example
> > can be run without the main() method:
> >
> > public int run(Path inputPath,
> >                 Path outputPath,
> >                 Path outputTmpPath,
> >                 Path workingDirPath,
> >                 int numRows,
> >                 int numCols,
> >                 boolean isSymmetric,
> >                 int desiredRank)
> >
> > is something you could run right after instantiating a
> > DistributedLanczosSolver and
> > .setConf()'ing it.
> >
> > So is that the kind of thing we'd want more of?  Or are you thinking of
> > something
> > nicer, where instead of just a response code, you get handles on java
> > objects which
> > are pointing to the output data sets in some way?  I suppose it's not
> > terribly hard
> > to just do
> >
> >  DistributedRowMatrix outputData =
> >     new DRM(outputPath, myTmpPath, numRows, numCols);
> >
> > after running another job, but maybe it would be even nicer to return a
> > struct-like
> > thing which has all the relevant output data as java objects.
> >
> > Another thing would be making sure that running these classes didn't
> require
> > such long method argument lists - builders to the rescue!
> >
> >  -jake
> >
> >
> > On Mon, Feb 13, 2012 at 9:31 AM, John Conwell <jo...@iamjohn.me> wrote:
> >
> >> From my perspective, I'd really like to see the Mahout API migrate away
> >> from a command line centric design it currently utilizes, and migrate
> more
> >> towards an library centric API design.  I think this would go a long
> way in
> >> getting Mahout adopted into real life commercial applications.
> >>
> >> While there might be a few algorithm drivers that you interact with by
> >> creating an instance of a class, and calling some method(s) on the
> instance
> >> to interact with it (I havent actually seen one like that, but there
> might
> >> be a few), many algorithms are invoked by calling some static function
> on a
> >> class that takes ~37 typed arguments.  Buts whats worse, many drivers
> are
> >> invoked by having to create a String array with ~37 arguments as string
> >> values, and calling the static main function on the class.
> >>
> >> Now I'm not saying that having a static main function available to
> invoke
> >> an algorithm from the command line isn't useful.  It is, when your
> testing
> >> an algorithm.  But once you want to integrate the algorithm into a
> >> commercial workflow it kind of sucks.
> >>
> >> For example, immagine if the API for invoking Math.max was designed the
> way
> >> many of the Mahout algorithms currently are?  You'd have something like
> >> this:
> >>
> >> String[] args = new String[2];
> >> args[0] = "max";
> >> args[1] = "7";
> >> args[0] = "4";
> >> int max = Math.main(args);
> >>
> >> It makes your code a horrible mess and very hard to maintain, as well as
> >> very prone to bugs.
> >>
> >> When I see a bunch of static main functions as the only way to interact
> >> with a library, no matter what the quality of the library is, my initial
> >> impression is that this has to be some minimally supported effort by a
> few
> >> PhD candidates still in academia, who will drop the project as soon as
> they
> >> graduate.  And while this might not be the case, it is one of the first
> >> impressions it gives, and can lead a company to drop the library from
> >> consideration before they do any due diligence into its quality and
> >> utility.
> >>
> >> I think as Mahout matures and gets closer to a 1.0 release, this kind of
> >> API re-design will become more and more necessary, especially if you
> want a
> >> higher Mahout integration rate into commercial applications and
> workflows.
> >>
> >> Also, I hope I dont sound too negative.  I'm very impressed with Mahout
> and
> >> its capabilities.  I really like that there is a well thought out class
> >> library of primitives for designing new serial and distributed machine
> >> learning algorithms.  And I think it has a high utility for integration
> >> into highly visible commercial projects.  But its high level public API
> >> really is a barrier to entry when trying to design commercial
> applications.
> >>
> >>
> >> On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
> >> <jd...@windwardsolutions.com>wrote:
> >>
> >> > We have a couple JIRAs that relate here: We want to factor all the
> (-cl)
> >> > classification steps out of all of the driver classes (MAHOUT-930) and
> >> into
> >> > a separate job to remove duplicated code; MAHOUT-931 is to add a
> >> pluggable
> >> > outlier removal capability to this job; and MAHOUT-933 is aimed at
> >> > factoring all the iteration mechanics from each driver class into the
> >> > ClusterIterator, which uses a ClusterClassifier which is itself an
> >> > OnlineLearner. This will hopefully allow semi-supervised classifier
> >> > applications to be constructed by feeding cluster-derived models into
> the
> >> > classification process. Still kind of fuzzy at this point but
> promising
> >> too.
> >> >
> >> > On 2/11/12 2:29 PM, Frank Scholten wrote:
> >> >
> >> >> ...
> >> >>
> >> >> What kind of clustering refactoring do mean here? I did some work on
> >> >> creating bean configurations in the past (MAHOUT-612). I
> underestimated
> >> the
> >> >> amount of work required to do the entire refactoring. If this can be
> >> >> contributed and committed on a per-job basis I would like to help
> out.
> >> >>
> >> >>> ...
> >> >>>
> >> >>
> >> >>
> >> >
> >>
> >>
> >> --
> >>
> >> Thanks,
> >> John C
> >>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Goals for Mahout 0.7

Posted by Lance Norskog <go...@gmail.com>.
Some jobs have an intermediate Options pojo just for storing options:
org.apache.mahout.utils.email.MailOptions

Another option is the hadoop xml file property system; this is how
values get into mappers&reducers anyway.

On Mon, Feb 13, 2012 at 11:11 AM, Jake Mannix <ja...@gmail.com> wrote:
> Hi John,
>
>  This is some very good feedback, and warrants serious discussion.  In
> spite
> of this, I'm going to respond on the fly with some thoughts in this vein.
>
>  We use Mahout at Twitter (the LDA stuff recently put in, and
> mahout-collections
> in various places, among other things) in production, and we use it,
> actually,
> via command-line invocations of the $MAHOUT_HOME/bin/mahout shell
> script.  It's invoked in an environment where we keep all of the parameters
> passed in in various (revision controlled) config files and the inputs are
> produced
> from a series Pig jobs which are invoked in similar ways, and the outputs on
> HDFS are loaded by various and sundry processes in their own ways.
>
>  So in general, I totally agree with you that having production *java*
> apps call
> into main() methods of other classes is extremely ugly and error-prone.   So
> how would it look to interact via a nice java API to a system which was
> going
> to launch some (possibly iterative series of) MapReduce jobs?
>
>  I guess I can see how this would go: DistributedLanczosSolver, for example
> can be run without the main() method:
>
> public int run(Path inputPath,
>                 Path outputPath,
>                 Path outputTmpPath,
>                 Path workingDirPath,
>                 int numRows,
>                 int numCols,
>                 boolean isSymmetric,
>                 int desiredRank)
>
> is something you could run right after instantiating a
> DistributedLanczosSolver and
> .setConf()'ing it.
>
> So is that the kind of thing we'd want more of?  Or are you thinking of
> something
> nicer, where instead of just a response code, you get handles on java
> objects which
> are pointing to the output data sets in some way?  I suppose it's not
> terribly hard
> to just do
>
>  DistributedRowMatrix outputData =
>     new DRM(outputPath, myTmpPath, numRows, numCols);
>
> after running another job, but maybe it would be even nicer to return a
> struct-like
> thing which has all the relevant output data as java objects.
>
> Another thing would be making sure that running these classes didn't require
> such long method argument lists - builders to the rescue!
>
>  -jake
>
>
> On Mon, Feb 13, 2012 at 9:31 AM, John Conwell <jo...@iamjohn.me> wrote:
>
>> From my perspective, I'd really like to see the Mahout API migrate away
>> from a command line centric design it currently utilizes, and migrate more
>> towards an library centric API design.  I think this would go a long way in
>> getting Mahout adopted into real life commercial applications.
>>
>> While there might be a few algorithm drivers that you interact with by
>> creating an instance of a class, and calling some method(s) on the instance
>> to interact with it (I havent actually seen one like that, but there might
>> be a few), many algorithms are invoked by calling some static function on a
>> class that takes ~37 typed arguments.  Buts whats worse, many drivers are
>> invoked by having to create a String array with ~37 arguments as string
>> values, and calling the static main function on the class.
>>
>> Now I'm not saying that having a static main function available to invoke
>> an algorithm from the command line isn't useful.  It is, when your testing
>> an algorithm.  But once you want to integrate the algorithm into a
>> commercial workflow it kind of sucks.
>>
>> For example, immagine if the API for invoking Math.max was designed the way
>> many of the Mahout algorithms currently are?  You'd have something like
>> this:
>>
>> String[] args = new String[2];
>> args[0] = "max";
>> args[1] = "7";
>> args[0] = "4";
>> int max = Math.main(args);
>>
>> It makes your code a horrible mess and very hard to maintain, as well as
>> very prone to bugs.
>>
>> When I see a bunch of static main functions as the only way to interact
>> with a library, no matter what the quality of the library is, my initial
>> impression is that this has to be some minimally supported effort by a few
>> PhD candidates still in academia, who will drop the project as soon as they
>> graduate.  And while this might not be the case, it is one of the first
>> impressions it gives, and can lead a company to drop the library from
>> consideration before they do any due diligence into its quality and
>> utility.
>>
>> I think as Mahout matures and gets closer to a 1.0 release, this kind of
>> API re-design will become more and more necessary, especially if you want a
>> higher Mahout integration rate into commercial applications and workflows.
>>
>> Also, I hope I dont sound too negative.  I'm very impressed with Mahout and
>> its capabilities.  I really like that there is a well thought out class
>> library of primitives for designing new serial and distributed machine
>> learning algorithms.  And I think it has a high utility for integration
>> into highly visible commercial projects.  But its high level public API
>> really is a barrier to entry when trying to design commercial applications.
>>
>>
>> On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
>> <jd...@windwardsolutions.com>wrote:
>>
>> > We have a couple JIRAs that relate here: We want to factor all the (-cl)
>> > classification steps out of all of the driver classes (MAHOUT-930) and
>> into
>> > a separate job to remove duplicated code; MAHOUT-931 is to add a
>> pluggable
>> > outlier removal capability to this job; and MAHOUT-933 is aimed at
>> > factoring all the iteration mechanics from each driver class into the
>> > ClusterIterator, which uses a ClusterClassifier which is itself an
>> > OnlineLearner. This will hopefully allow semi-supervised classifier
>> > applications to be constructed by feeding cluster-derived models into the
>> > classification process. Still kind of fuzzy at this point but promising
>> too.
>> >
>> > On 2/11/12 2:29 PM, Frank Scholten wrote:
>> >
>> >> ...
>> >>
>> >> What kind of clustering refactoring do mean here? I did some work on
>> >> creating bean configurations in the past (MAHOUT-612). I underestimated
>> the
>> >> amount of work required to do the entire refactoring. If this can be
>> >> contributed and committed on a per-job basis I would like to help out.
>> >>
>> >>> ...
>> >>>
>> >>
>> >>
>> >
>>
>>
>> --
>>
>> Thanks,
>> John C
>>



-- 
Lance Norskog
goksron@gmail.com

Re: Goals for Mahout 0.7

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Wed, Feb 22, 2012 at 10:44 AM, Jake Mannix <ja...@gmail.com> wrote:
> On Wed, Feb 22, 2012 at 10:00 AM, John Conwell <jo...@iamjohn.me> wrote:
>

>>
>> But the second one is VERY important and can be a show stopper.  Any large
>> workflow that uses Hadoop somewhere in its API stack needs two things.
>>  First any call to Hadoop needs to expose to the caller some kind of handle
>> / identifier to the hadoop job that was launched.  This is because the
>> caller should be able to monitor the hadoop job, provide status and
>> feedback to the users, troubleshoot, etc, any kind of long running process.
>>  And if the Mahout API call invokes multiple Hadoop jobs in a row, as often
>> is the case in Mahout, the caller needs to be able to gain access to each
>> of hadoop job ids as they become available.  The second thing is any
>> blocking long running API call needs to expose the option to run the call
>> asynchronously (and provided hadoop job ids as the hadoop jobs get
>> invoked).
>>
>> Take for example, the LSA algorithm.  Its not unreasonable to say that
>> calling LDADriver.run() could start a chain of N mapreduce jobs that could
>> take 8 hours to complete, given a large enough corpus of documents and
>> large enough number of iterations.  In trying to integrate this into a
>> workflow application I have to design my app knowing that every time it
>> calls LDADriver.run() it could potentially block the process from several
>> hours to several days, with now way to inspect the progress of what is
>> happening.  The core problems are; my app has no idea how long its going to
>> block, how far along the blocked process is, if any of the mapreduce jobs
>> failed, and if they did fail which mapreduce jobs are associated with the
>> what call to LDADriver.run().
>>
>> But if all algorithm API calls allowed me to invoke them asynchronously,
>> and provided me with an object that I could use to track what is going on
>> in Hadoop, such as a realtime updated list of job ids for example (an
>> eventing mechanism when new job ids are added would be nice, but not a
>> must), it would go a long way in easing the barrier to entry of integrating
>> Mahout into commercial applications.
>>
>
> +1  I like this idea: synchronously return a handle to a MahoutStatus
> object,
> which you can poll for current status, current paths to output stuff, even
> handles to intermediate state (and eventually final state), that would
> be awesome.  I like this, it's totally pro-style, unlike what we have now.

That uniformity here would probably be also very useful for R integration.

-d

Re: Goals for Mahout 0.7

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
On 2/22/12 11:44 AM, Jake Mannix wrote:
> On Wed, Feb 22, 2012 at 10:00 AM, John Conwell<jo...@iamjohn.me>  wrote:
>
>> I've been meaning to respond with my thoughts to this (though it took me
>> almost two weeks to get around to it).
>>
>> Jake, your example of the DistributedLanczosSolver in how to interact with
>> the different algorithms is along the lines of what I was thinking, at
>> least as a bare minimum.  I'm a huge fan of using Builder classes for these
>> types of scenarios, but I do understand that they are a pain to write, so
>> in the short term to get all the algorithms API friendly by just having run
>> functions with typed arguments is fine.  Anything to get rid of my String[]
>> args variables I'm creating and passing around.
>>
>> You also mention the output to the algorithm APIs.  I'm not a big fan of
>> the returned 1 or 0 response codes.  Seeing that sends me into COM hResult
>> PTSD invoked panic attacks (NOTE: I'm not making light of PTSD).  Except
>> its worse than hResults, because at least there were multiple hResults
>> values that theoretically I could look up to figure out the actual problem
>> that occurred.
>>
>> If I had my way, I would want the API output to return me two things:
>> handles/objects that point to all the generated output of the algorithm
>> (like you mentioned), and an object that gives me all the information I
>> need to track the Hadoop mapreduce jobs that were invoked by the API call.
>>
>> The first one is a nice to have.  Since I most likely pass in a Path object
>> to where I want the output to go, I know where the output is, and I should
>> be able to infer what type of data it is, and so forth.  Having output
>> handles to this data would be really nice, and make integrating Mahout into
>> larger workflows much easier, but its not a show stopper.
>>
>> But the second one is VERY important and can be a show stopper.  Any large
>> workflow that uses Hadoop somewhere in its API stack needs two things.
>>   First any call to Hadoop needs to expose to the caller some kind of handle
>> / identifier to the hadoop job that was launched.  This is because the
>> caller should be able to monitor the hadoop job, provide status and
>> feedback to the users, troubleshoot, etc, any kind of long running process.
>>   And if the Mahout API call invokes multiple Hadoop jobs in a row, as often
>> is the case in Mahout, the caller needs to be able to gain access to each
>> of hadoop job ids as they become available.  The second thing is any
>> blocking long running API call needs to expose the option to run the call
>> asynchronously (and provided hadoop job ids as the hadoop jobs get
>> invoked).
>>
>> Take for example, the LSA algorithm.  Its not unreasonable to say that
>> calling LDADriver.run() could start a chain of N mapreduce jobs that could
>> take 8 hours to complete, given a large enough corpus of documents and
>> large enough number of iterations.  In trying to integrate this into a
>> workflow application I have to design my app knowing that every time it
>> calls LDADriver.run() it could potentially block the process from several
>> hours to several days, with now way to inspect the progress of what is
>> happening.  The core problems are; my app has no idea how long its going to
>> block, how far along the blocked process is, if any of the mapreduce jobs
>> failed, and if they did fail which mapreduce jobs are associated with the
>> what call to LDADriver.run().
>>
>> But if all algorithm API calls allowed me to invoke them asynchronously,
>> and provided me with an object that I could use to track what is going on
>> in Hadoop, such as a realtime updated list of job ids for example (an
>> eventing mechanism when new job ids are added would be nice, but not a
>> must), it would go a long way in easing the barrier to entry of integrating
>> Mahout into commercial applications.
>>
+1 here too. If one of you can come up with a design I'm in for using it 
with clustering while we refactor.
> +1  I like this idea: synchronously return a handle to a MahoutStatus
> object,
> which you can poll for current status, current paths to output stuff, even
> handles to intermediate state (and eventually final state), that would
> be awesome.  I like this, it's totally pro-style, unlike what we have now.
>
>
>> One last thing: I'd like to see Mahout getting away from using static
>> functions so much.  I don't really have a non-religious reason for this,
>> other than to say that I find when people use API's that are very static
>> function heavy they tend to write their own code in the same way, and you
>> end up with 1000 line monolithic functions being invoked from main()
>> functions, which is never a good thing.
>>
+2 from me on this one. Static functions really limit testability too 
since they thwart dependency-injection
> Agreed, big-time.  Static functions actually *are* the devil, for the most
> part.  I actually do subscribe to that religion, but I haven't been to
> church in a long time.  Mea culpa?
>
>
>> Is that too much to ask?  :)
>>
> Not at all.
>
>    -jake
>
>
>> On Mon, Feb 13, 2012 at 11:11 AM, Jake Mannix<ja...@gmail.com>
>> wrote:
>>
>>> Hi John,
>>>
>>>   This is some very good feedback, and warrants serious discussion.  In
>>> spite
>>> of this, I'm going to respond on the fly with some thoughts in this vein.
>>>
>>>   We use Mahout at Twitter (the LDA stuff recently put in, and
>>> mahout-collections
>>> in various places, among other things) in production, and we use it,
>>> actually,
>>> via command-line invocations of the $MAHOUT_HOME/bin/mahout shell
>>> script.  It's invoked in an environment where we keep all of the
>> parameters
>>> passed in in various (revision controlled) config files and the inputs
>> are
>>> produced
>>> from a series Pig jobs which are invoked in similar ways, and the outputs
>>> on
>>> HDFS are loaded by various and sundry processes in their own ways.
>>>
>>>   So in general, I totally agree with you that having production *java*
>>> apps call
>>> into main() methods of other classes is extremely ugly and error-prone.
>>> So
>>> how would it look to interact via a nice java API to a system which was
>>> going
>>> to launch some (possibly iterative series of) MapReduce jobs?
>>>
>>>   I guess I can see how this would go: DistributedLanczosSolver, for
>> example
>>> can be run without the main() method:
>>>
>>> public int run(Path inputPath,
>>>                  Path outputPath,
>>>                  Path outputTmpPath,
>>>                  Path workingDirPath,
>>>                  int numRows,
>>>                  int numCols,
>>>                  boolean isSymmetric,
>>>                  int desiredRank)
>>>
>>> is something you could run right after instantiating a
>>> DistributedLanczosSolver and
>>> .setConf()'ing it.
>>>
>>> So is that the kind of thing we'd want more of?  Or are you thinking of
>>> something
>>> nicer, where instead of just a response code, you get handles on java
>>> objects which
>>> are pointing to the output data sets in some way?  I suppose it's not
>>> terribly hard
>>> to just do
>>>
>>>   DistributedRowMatrix outputData =
>>>      new DRM(outputPath, myTmpPath, numRows, numCols);
>>>
>>> after running another job, but maybe it would be even nicer to return a
>>> struct-like
>>> thing which has all the relevant output data as java objects.
>>>
>>> Another thing would be making sure that running these classes didn't
>>> require
>>> such long method argument lists - builders to the rescue!
>>>
>>>   -jake
>>>
>>>
>>> On Mon, Feb 13, 2012 at 9:31 AM, John Conwell<jo...@iamjohn.me>  wrote:
>>>
>>>>  From my perspective, I'd really like to see the Mahout API migrate away
>>>> from a command line centric design it currently utilizes, and migrate
>>> more
>>>> towards an library centric API design.  I think this would go a long
>> way
>>> in
>>>> getting Mahout adopted into real life commercial applications.
>>>>
>>>> While there might be a few algorithm drivers that you interact with by
>>>> creating an instance of a class, and calling some method(s) on the
>>> instance
>>>> to interact with it (I havent actually seen one like that, but there
>>> might
>>>> be a few), many algorithms are invoked by calling some static function
>>> on a
>>>> class that takes ~37 typed arguments.  Buts whats worse, many drivers
>> are
>>>> invoked by having to create a String array with ~37 arguments as string
>>>> values, and calling the static main function on the class.
>>>>
>>>> Now I'm not saying that having a static main function available to
>> invoke
>>>> an algorithm from the command line isn't useful.  It is, when your
>>> testing
>>>> an algorithm.  But once you want to integrate the algorithm into a
>>>> commercial workflow it kind of sucks.
>>>>
>>>> For example, immagine if the API for invoking Math.max was designed the
>>> way
>>>> many of the Mahout algorithms currently are?  You'd have something like
>>>> this:
>>>>
>>>> String[] args = new String[2];
>>>> args[0] = "max";
>>>> args[1] = "7";
>>>> args[0] = "4";
>>>> int max = Math.main(args);
>>>>
>>>> It makes your code a horrible mess and very hard to maintain, as well
>> as
>>>> very prone to bugs.
>>>>
>>>> When I see a bunch of static main functions as the only way to interact
>>>> with a library, no matter what the quality of the library is, my
>> initial
>>>> impression is that this has to be some minimally supported effort by a
>>> few
>>>> PhD candidates still in academia, who will drop the project as soon as
>>> they
>>>> graduate.  And while this might not be the case, it is one of the first
>>>> impressions it gives, and can lead a company to drop the library from
>>>> consideration before they do any due diligence into its quality and
>>>> utility.
>>>>
>>>> I think as Mahout matures and gets closer to a 1.0 release, this kind
>> of
>>>> API re-design will become more and more necessary, especially if you
>>> want a
>>>> higher Mahout integration rate into commercial applications and
>>> workflows.
>>>> Also, I hope I dont sound too negative.  I'm very impressed with Mahout
>>> and
>>>> its capabilities.  I really like that there is a well thought out class
>>>> library of primitives for designing new serial and distributed machine
>>>> learning algorithms.  And I think it has a high utility for integration
>>>> into highly visible commercial projects.  But its high level public API
>>>> really is a barrier to entry when trying to design commercial
>>> applications.
>>>>
>>>> On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
>>>> <jd...@windwardsolutions.com>wrote:
>>>>
>>>>> We have a couple JIRAs that relate here: We want to factor all the
>>> (-cl)
>>>>> classification steps out of all of the driver classes (MAHOUT-930)
>> and
>>>> into
>>>>> a separate job to remove duplicated code; MAHOUT-931 is to add a
>>>> pluggable
>>>>> outlier removal capability to this job; and MAHOUT-933 is aimed at
>>>>> factoring all the iteration mechanics from each driver class into the
>>>>> ClusterIterator, which uses a ClusterClassifier which is itself an
>>>>> OnlineLearner. This will hopefully allow semi-supervised classifier
>>>>> applications to be constructed by feeding cluster-derived models into
>>> the
>>>>> classification process. Still kind of fuzzy at this point but
>> promising
>>>> too.
>>>>> On 2/11/12 2:29 PM, Frank Scholten wrote:
>>>>>
>>>>>> ...
>>>>>>
>>>>>> What kind of clustering refactoring do mean here? I did some work on
>>>>>> creating bean configurations in the past (MAHOUT-612). I
>>> underestimated
>>>> the
>>>>>> amount of work required to do the entire refactoring. If this can be
>>>>>> contributed and committed on a per-job basis I would like to help
>> out.
>>>>>>> ...
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>>
>>>> Thanks,
>>>> John C
>>>>
>>
>>
>> --
>>
>> Thanks,
>> John C
>>


Re: Goals for Mahout 0.7

Posted by Jake Mannix <ja...@gmail.com>.
On Wed, Feb 22, 2012 at 10:00 AM, John Conwell <jo...@iamjohn.me> wrote:

> I've been meaning to respond with my thoughts to this (though it took me
> almost two weeks to get around to it).
>
> Jake, your example of the DistributedLanczosSolver in how to interact with
> the different algorithms is along the lines of what I was thinking, at
> least as a bare minimum.  I'm a huge fan of using Builder classes for these
> types of scenarios, but I do understand that they are a pain to write, so
> in the short term to get all the algorithms API friendly by just having run
> functions with typed arguments is fine.  Anything to get rid of my String[]
> args variables I'm creating and passing around.
>
> You also mention the output to the algorithm APIs.  I'm not a big fan of
> the returned 1 or 0 response codes.  Seeing that sends me into COM hResult
> PTSD invoked panic attacks (NOTE: I'm not making light of PTSD).  Except
> its worse than hResults, because at least there were multiple hResults
> values that theoretically I could look up to figure out the actual problem
> that occurred.
>
> If I had my way, I would want the API output to return me two things:
> handles/objects that point to all the generated output of the algorithm
> (like you mentioned), and an object that gives me all the information I
> need to track the Hadoop mapreduce jobs that were invoked by the API call.
>
> The first one is a nice to have.  Since I most likely pass in a Path object
> to where I want the output to go, I know where the output is, and I should
> be able to infer what type of data it is, and so forth.  Having output
> handles to this data would be really nice, and make integrating Mahout into
> larger workflows much easier, but its not a show stopper.
>
> But the second one is VERY important and can be a show stopper.  Any large
> workflow that uses Hadoop somewhere in its API stack needs two things.
>  First any call to Hadoop needs to expose to the caller some kind of handle
> / identifier to the hadoop job that was launched.  This is because the
> caller should be able to monitor the hadoop job, provide status and
> feedback to the users, troubleshoot, etc, any kind of long running process.
>  And if the Mahout API call invokes multiple Hadoop jobs in a row, as often
> is the case in Mahout, the caller needs to be able to gain access to each
> of hadoop job ids as they become available.  The second thing is any
> blocking long running API call needs to expose the option to run the call
> asynchronously (and provided hadoop job ids as the hadoop jobs get
> invoked).
>
> Take for example, the LSA algorithm.  Its not unreasonable to say that
> calling LDADriver.run() could start a chain of N mapreduce jobs that could
> take 8 hours to complete, given a large enough corpus of documents and
> large enough number of iterations.  In trying to integrate this into a
> workflow application I have to design my app knowing that every time it
> calls LDADriver.run() it could potentially block the process from several
> hours to several days, with now way to inspect the progress of what is
> happening.  The core problems are; my app has no idea how long its going to
> block, how far along the blocked process is, if any of the mapreduce jobs
> failed, and if they did fail which mapreduce jobs are associated with the
> what call to LDADriver.run().
>
> But if all algorithm API calls allowed me to invoke them asynchronously,
> and provided me with an object that I could use to track what is going on
> in Hadoop, such as a realtime updated list of job ids for example (an
> eventing mechanism when new job ids are added would be nice, but not a
> must), it would go a long way in easing the barrier to entry of integrating
> Mahout into commercial applications.
>

+1  I like this idea: synchronously return a handle to a MahoutStatus
object,
which you can poll for current status, current paths to output stuff, even
handles to intermediate state (and eventually final state), that would
be awesome.  I like this, it's totally pro-style, unlike what we have now.


> One last thing: I'd like to see Mahout getting away from using static
> functions so much.  I don't really have a non-religious reason for this,
> other than to say that I find when people use API's that are very static
> function heavy they tend to write their own code in the same way, and you
> end up with 1000 line monolithic functions being invoked from main()
> functions, which is never a good thing.
>

Agreed, big-time.  Static functions actually *are* the devil, for the most
part.  I actually do subscribe to that religion, but I haven't been to
church in a long time.  Mea culpa?


> Is that too much to ask?  :)
>

Not at all.

  -jake


>
> On Mon, Feb 13, 2012 at 11:11 AM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Hi John,
> >
> >  This is some very good feedback, and warrants serious discussion.  In
> > spite
> > of this, I'm going to respond on the fly with some thoughts in this vein.
> >
> >  We use Mahout at Twitter (the LDA stuff recently put in, and
> > mahout-collections
> > in various places, among other things) in production, and we use it,
> > actually,
> > via command-line invocations of the $MAHOUT_HOME/bin/mahout shell
> > script.  It's invoked in an environment where we keep all of the
> parameters
> > passed in in various (revision controlled) config files and the inputs
> are
> > produced
> > from a series Pig jobs which are invoked in similar ways, and the outputs
> > on
> > HDFS are loaded by various and sundry processes in their own ways.
> >
> >  So in general, I totally agree with you that having production *java*
> > apps call
> > into main() methods of other classes is extremely ugly and error-prone.
> > So
> > how would it look to interact via a nice java API to a system which was
> > going
> > to launch some (possibly iterative series of) MapReduce jobs?
> >
> >  I guess I can see how this would go: DistributedLanczosSolver, for
> example
> > can be run without the main() method:
> >
> > public int run(Path inputPath,
> >                 Path outputPath,
> >                 Path outputTmpPath,
> >                 Path workingDirPath,
> >                 int numRows,
> >                 int numCols,
> >                 boolean isSymmetric,
> >                 int desiredRank)
> >
> > is something you could run right after instantiating a
> > DistributedLanczosSolver and
> > .setConf()'ing it.
> >
> > So is that the kind of thing we'd want more of?  Or are you thinking of
> > something
> > nicer, where instead of just a response code, you get handles on java
> > objects which
> > are pointing to the output data sets in some way?  I suppose it's not
> > terribly hard
> > to just do
> >
> >  DistributedRowMatrix outputData =
> >     new DRM(outputPath, myTmpPath, numRows, numCols);
> >
> > after running another job, but maybe it would be even nicer to return a
> > struct-like
> > thing which has all the relevant output data as java objects.
> >
> > Another thing would be making sure that running these classes didn't
> > require
> > such long method argument lists - builders to the rescue!
> >
> >  -jake
> >
> >
> > On Mon, Feb 13, 2012 at 9:31 AM, John Conwell <jo...@iamjohn.me> wrote:
> >
> > > From my perspective, I'd really like to see the Mahout API migrate away
> > > from a command line centric design it currently utilizes, and migrate
> > more
> > > towards an library centric API design.  I think this would go a long
> way
> > in
> > > getting Mahout adopted into real life commercial applications.
> > >
> > > While there might be a few algorithm drivers that you interact with by
> > > creating an instance of a class, and calling some method(s) on the
> > instance
> > > to interact with it (I havent actually seen one like that, but there
> > might
> > > be a few), many algorithms are invoked by calling some static function
> > on a
> > > class that takes ~37 typed arguments.  Buts whats worse, many drivers
> are
> > > invoked by having to create a String array with ~37 arguments as string
> > > values, and calling the static main function on the class.
> > >
> > > Now I'm not saying that having a static main function available to
> invoke
> > > an algorithm from the command line isn't useful.  It is, when your
> > testing
> > > an algorithm.  But once you want to integrate the algorithm into a
> > > commercial workflow it kind of sucks.
> > >
> > > For example, immagine if the API for invoking Math.max was designed the
> > way
> > > many of the Mahout algorithms currently are?  You'd have something like
> > > this:
> > >
> > > String[] args = new String[2];
> > > args[0] = "max";
> > > args[1] = "7";
> > > args[0] = "4";
> > > int max = Math.main(args);
> > >
> > > It makes your code a horrible mess and very hard to maintain, as well
> as
> > > very prone to bugs.
> > >
> > > When I see a bunch of static main functions as the only way to interact
> > > with a library, no matter what the quality of the library is, my
> initial
> > > impression is that this has to be some minimally supported effort by a
> > few
> > > PhD candidates still in academia, who will drop the project as soon as
> > they
> > > graduate.  And while this might not be the case, it is one of the first
> > > impressions it gives, and can lead a company to drop the library from
> > > consideration before they do any due diligence into its quality and
> > > utility.
> > >
> > > I think as Mahout matures and gets closer to a 1.0 release, this kind
> of
> > > API re-design will become more and more necessary, especially if you
> > want a
> > > higher Mahout integration rate into commercial applications and
> > workflows.
> > >
> > > Also, I hope I dont sound too negative.  I'm very impressed with Mahout
> > and
> > > its capabilities.  I really like that there is a well thought out class
> > > library of primitives for designing new serial and distributed machine
> > > learning algorithms.  And I think it has a high utility for integration
> > > into highly visible commercial projects.  But its high level public API
> > > really is a barrier to entry when trying to design commercial
> > applications.
> > >
> > >
> > > On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
> > > <jd...@windwardsolutions.com>wrote:
> > >
> > > > We have a couple JIRAs that relate here: We want to factor all the
> > (-cl)
> > > > classification steps out of all of the driver classes (MAHOUT-930)
> and
> > > into
> > > > a separate job to remove duplicated code; MAHOUT-931 is to add a
> > > pluggable
> > > > outlier removal capability to this job; and MAHOUT-933 is aimed at
> > > > factoring all the iteration mechanics from each driver class into the
> > > > ClusterIterator, which uses a ClusterClassifier which is itself an
> > > > OnlineLearner. This will hopefully allow semi-supervised classifier
> > > > applications to be constructed by feeding cluster-derived models into
> > the
> > > > classification process. Still kind of fuzzy at this point but
> promising
> > > too.
> > > >
> > > > On 2/11/12 2:29 PM, Frank Scholten wrote:
> > > >
> > > >> ...
> > > >>
> > > >> What kind of clustering refactoring do mean here? I did some work on
> > > >> creating bean configurations in the past (MAHOUT-612). I
> > underestimated
> > > the
> > > >> amount of work required to do the entire refactoring. If this can be
> > > >> contributed and committed on a per-job basis I would like to help
> out.
> > > >>
> > > >>> ...
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> > >
> > > --
> > >
> > > Thanks,
> > > John C
> > >
> >
>
>
>
> --
>
> Thanks,
> John C
>

Re: Goals for Mahout 0.7

Posted by John Conwell <jo...@iamjohn.me>.
I've been meaning to respond with my thoughts to this (though it took me
almost two weeks to get around to it).

Jake, your example of the DistributedLanczosSolver in how to interact with
the different algorithms is along the lines of what I was thinking, at
least as a bare minimum.  I'm a huge fan of using Builder classes for these
types of scenarios, but I do understand that they are a pain to write, so
in the short term to get all the algorithms API friendly by just having run
functions with typed arguments is fine.  Anything to get rid of my String[]
args variables I'm creating and passing around.

You also mention the output to the algorithm APIs.  I'm not a big fan of
the returned 1 or 0 response codes.  Seeing that sends me into COM hResult
PTSD invoked panic attacks (NOTE: I'm not making light of PTSD).  Except
its worse than hResults, because at least there were multiple hResults
values that theoretically I could look up to figure out the actual problem
that occurred.

If I had my way, I would want the API output to return me two things:
handles/objects that point to all the generated output of the algorithm
(like you mentioned), and an object that gives me all the information I
need to track the Hadoop mapreduce jobs that were invoked by the API call.

The first one is a nice to have.  Since I most likely pass in a Path object
to where I want the output to go, I know where the output is, and I should
be able to infer what type of data it is, and so forth.  Having output
handles to this data would be really nice, and make integrating Mahout into
larger workflows much easier, but its not a show stopper.

But the second one is VERY important and can be a show stopper.  Any large
workflow that uses Hadoop somewhere in its API stack needs two things.
 First any call to Hadoop needs to expose to the caller some kind of handle
/ identifier to the hadoop job that was launched.  This is because the
caller should be able to monitor the hadoop job, provide status and
feedback to the users, troubleshoot, etc, any kind of long running process.
 And if the Mahout API call invokes multiple Hadoop jobs in a row, as often
is the case in Mahout, the caller needs to be able to gain access to each
of hadoop job ids as they become available.  The second thing is any
blocking long running API call needs to expose the option to run the call
asynchronously (and provided hadoop job ids as the hadoop jobs get
invoked).

Take for example, the LSA algorithm.  Its not unreasonable to say that
calling LDADriver.run() could start a chain of N mapreduce jobs that could
take 8 hours to complete, given a large enough corpus of documents and
large enough number of iterations.  In trying to integrate this into a
workflow application I have to design my app knowing that every time it
calls LDADriver.run() it could potentially block the process from several
hours to several days, with now way to inspect the progress of what is
happening.  The core problems are; my app has no idea how long its going to
block, how far along the blocked process is, if any of the mapreduce jobs
failed, and if they did fail which mapreduce jobs are associated with the
what call to LDADriver.run().

But if all algorithm API calls allowed me to invoke them asynchronously,
and provided me with an object that I could use to track what is going on
in Hadoop, such as a realtime updated list of job ids for example (an
eventing mechanism when new job ids are added would be nice, but not a
must), it would go a long way in easing the barrier to entry of integrating
Mahout into commercial applications.

One last thing: I'd like to see Mahout getting away from using static
functions so much.  I don't really have a non-religious reason for this,
other than to say that I find when people use API's that are very static
function heavy they tend to write their own code in the same way, and you
end up with 1000 line monolithic functions being invoked from main()
functions, which is never a good thing.

Is that too much to ask?  :)

On Mon, Feb 13, 2012 at 11:11 AM, Jake Mannix <ja...@gmail.com> wrote:

> Hi John,
>
>  This is some very good feedback, and warrants serious discussion.  In
> spite
> of this, I'm going to respond on the fly with some thoughts in this vein.
>
>  We use Mahout at Twitter (the LDA stuff recently put in, and
> mahout-collections
> in various places, among other things) in production, and we use it,
> actually,
> via command-line invocations of the $MAHOUT_HOME/bin/mahout shell
> script.  It's invoked in an environment where we keep all of the parameters
> passed in in various (revision controlled) config files and the inputs are
> produced
> from a series Pig jobs which are invoked in similar ways, and the outputs
> on
> HDFS are loaded by various and sundry processes in their own ways.
>
>  So in general, I totally agree with you that having production *java*
> apps call
> into main() methods of other classes is extremely ugly and error-prone.
> So
> how would it look to interact via a nice java API to a system which was
> going
> to launch some (possibly iterative series of) MapReduce jobs?
>
>  I guess I can see how this would go: DistributedLanczosSolver, for example
> can be run without the main() method:
>
> public int run(Path inputPath,
>                 Path outputPath,
>                 Path outputTmpPath,
>                 Path workingDirPath,
>                 int numRows,
>                 int numCols,
>                 boolean isSymmetric,
>                 int desiredRank)
>
> is something you could run right after instantiating a
> DistributedLanczosSolver and
> .setConf()'ing it.
>
> So is that the kind of thing we'd want more of?  Or are you thinking of
> something
> nicer, where instead of just a response code, you get handles on java
> objects which
> are pointing to the output data sets in some way?  I suppose it's not
> terribly hard
> to just do
>
>  DistributedRowMatrix outputData =
>     new DRM(outputPath, myTmpPath, numRows, numCols);
>
> after running another job, but maybe it would be even nicer to return a
> struct-like
> thing which has all the relevant output data as java objects.
>
> Another thing would be making sure that running these classes didn't
> require
> such long method argument lists - builders to the rescue!
>
>  -jake
>
>
> On Mon, Feb 13, 2012 at 9:31 AM, John Conwell <jo...@iamjohn.me> wrote:
>
> > From my perspective, I'd really like to see the Mahout API migrate away
> > from a command line centric design it currently utilizes, and migrate
> more
> > towards an library centric API design.  I think this would go a long way
> in
> > getting Mahout adopted into real life commercial applications.
> >
> > While there might be a few algorithm drivers that you interact with by
> > creating an instance of a class, and calling some method(s) on the
> instance
> > to interact with it (I havent actually seen one like that, but there
> might
> > be a few), many algorithms are invoked by calling some static function
> on a
> > class that takes ~37 typed arguments.  Buts whats worse, many drivers are
> > invoked by having to create a String array with ~37 arguments as string
> > values, and calling the static main function on the class.
> >
> > Now I'm not saying that having a static main function available to invoke
> > an algorithm from the command line isn't useful.  It is, when your
> testing
> > an algorithm.  But once you want to integrate the algorithm into a
> > commercial workflow it kind of sucks.
> >
> > For example, immagine if the API for invoking Math.max was designed the
> way
> > many of the Mahout algorithms currently are?  You'd have something like
> > this:
> >
> > String[] args = new String[2];
> > args[0] = "max";
> > args[1] = "7";
> > args[0] = "4";
> > int max = Math.main(args);
> >
> > It makes your code a horrible mess and very hard to maintain, as well as
> > very prone to bugs.
> >
> > When I see a bunch of static main functions as the only way to interact
> > with a library, no matter what the quality of the library is, my initial
> > impression is that this has to be some minimally supported effort by a
> few
> > PhD candidates still in academia, who will drop the project as soon as
> they
> > graduate.  And while this might not be the case, it is one of the first
> > impressions it gives, and can lead a company to drop the library from
> > consideration before they do any due diligence into its quality and
> > utility.
> >
> > I think as Mahout matures and gets closer to a 1.0 release, this kind of
> > API re-design will become more and more necessary, especially if you
> want a
> > higher Mahout integration rate into commercial applications and
> workflows.
> >
> > Also, I hope I dont sound too negative.  I'm very impressed with Mahout
> and
> > its capabilities.  I really like that there is a well thought out class
> > library of primitives for designing new serial and distributed machine
> > learning algorithms.  And I think it has a high utility for integration
> > into highly visible commercial projects.  But its high level public API
> > really is a barrier to entry when trying to design commercial
> applications.
> >
> >
> > On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
> > <jd...@windwardsolutions.com>wrote:
> >
> > > We have a couple JIRAs that relate here: We want to factor all the
> (-cl)
> > > classification steps out of all of the driver classes (MAHOUT-930) and
> > into
> > > a separate job to remove duplicated code; MAHOUT-931 is to add a
> > pluggable
> > > outlier removal capability to this job; and MAHOUT-933 is aimed at
> > > factoring all the iteration mechanics from each driver class into the
> > > ClusterIterator, which uses a ClusterClassifier which is itself an
> > > OnlineLearner. This will hopefully allow semi-supervised classifier
> > > applications to be constructed by feeding cluster-derived models into
> the
> > > classification process. Still kind of fuzzy at this point but promising
> > too.
> > >
> > > On 2/11/12 2:29 PM, Frank Scholten wrote:
> > >
> > >> ...
> > >>
> > >> What kind of clustering refactoring do mean here? I did some work on
> > >> creating bean configurations in the past (MAHOUT-612). I
> underestimated
> > the
> > >> amount of work required to do the entire refactoring. If this can be
> > >> contributed and committed on a per-job basis I would like to help out.
> > >>
> > >>> ...
> > >>>
> > >>
> > >>
> > >
> >
> >
> > --
> >
> > Thanks,
> > John C
> >
>



-- 

Thanks,
John C

Re: Goals for Mahout 0.7

Posted by Jake Mannix <ja...@gmail.com>.
Hi John,

  This is some very good feedback, and warrants serious discussion.  In
spite
of this, I'm going to respond on the fly with some thoughts in this vein.

  We use Mahout at Twitter (the LDA stuff recently put in, and
mahout-collections
in various places, among other things) in production, and we use it,
actually,
via command-line invocations of the $MAHOUT_HOME/bin/mahout shell
script.  It's invoked in an environment where we keep all of the parameters
passed in in various (revision controlled) config files and the inputs are
produced
from a series Pig jobs which are invoked in similar ways, and the outputs on
HDFS are loaded by various and sundry processes in their own ways.

  So in general, I totally agree with you that having production *java*
apps call
into main() methods of other classes is extremely ugly and error-prone.   So
how would it look to interact via a nice java API to a system which was
going
to launch some (possibly iterative series of) MapReduce jobs?

  I guess I can see how this would go: DistributedLanczosSolver, for example
can be run without the main() method:

public int run(Path inputPath,
                 Path outputPath,
                 Path outputTmpPath,
                 Path workingDirPath,
                 int numRows,
                 int numCols,
                 boolean isSymmetric,
                 int desiredRank)

is something you could run right after instantiating a
DistributedLanczosSolver and
.setConf()'ing it.

So is that the kind of thing we'd want more of?  Or are you thinking of
something
nicer, where instead of just a response code, you get handles on java
objects which
are pointing to the output data sets in some way?  I suppose it's not
terribly hard
to just do

  DistributedRowMatrix outputData =
     new DRM(outputPath, myTmpPath, numRows, numCols);

after running another job, but maybe it would be even nicer to return a
struct-like
thing which has all the relevant output data as java objects.

Another thing would be making sure that running these classes didn't require
such long method argument lists - builders to the rescue!

  -jake


On Mon, Feb 13, 2012 at 9:31 AM, John Conwell <jo...@iamjohn.me> wrote:

> From my perspective, I'd really like to see the Mahout API migrate away
> from a command line centric design it currently utilizes, and migrate more
> towards an library centric API design.  I think this would go a long way in
> getting Mahout adopted into real life commercial applications.
>
> While there might be a few algorithm drivers that you interact with by
> creating an instance of a class, and calling some method(s) on the instance
> to interact with it (I havent actually seen one like that, but there might
> be a few), many algorithms are invoked by calling some static function on a
> class that takes ~37 typed arguments.  Buts whats worse, many drivers are
> invoked by having to create a String array with ~37 arguments as string
> values, and calling the static main function on the class.
>
> Now I'm not saying that having a static main function available to invoke
> an algorithm from the command line isn't useful.  It is, when your testing
> an algorithm.  But once you want to integrate the algorithm into a
> commercial workflow it kind of sucks.
>
> For example, immagine if the API for invoking Math.max was designed the way
> many of the Mahout algorithms currently are?  You'd have something like
> this:
>
> String[] args = new String[2];
> args[0] = "max";
> args[1] = "7";
> args[0] = "4";
> int max = Math.main(args);
>
> It makes your code a horrible mess and very hard to maintain, as well as
> very prone to bugs.
>
> When I see a bunch of static main functions as the only way to interact
> with a library, no matter what the quality of the library is, my initial
> impression is that this has to be some minimally supported effort by a few
> PhD candidates still in academia, who will drop the project as soon as they
> graduate.  And while this might not be the case, it is one of the first
> impressions it gives, and can lead a company to drop the library from
> consideration before they do any due diligence into its quality and
> utility.
>
> I think as Mahout matures and gets closer to a 1.0 release, this kind of
> API re-design will become more and more necessary, especially if you want a
> higher Mahout integration rate into commercial applications and workflows.
>
> Also, I hope I dont sound too negative.  I'm very impressed with Mahout and
> its capabilities.  I really like that there is a well thought out class
> library of primitives for designing new serial and distributed machine
> learning algorithms.  And I think it has a high utility for integration
> into highly visible commercial projects.  But its high level public API
> really is a barrier to entry when trying to design commercial applications.
>
>
> On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
> > We have a couple JIRAs that relate here: We want to factor all the (-cl)
> > classification steps out of all of the driver classes (MAHOUT-930) and
> into
> > a separate job to remove duplicated code; MAHOUT-931 is to add a
> pluggable
> > outlier removal capability to this job; and MAHOUT-933 is aimed at
> > factoring all the iteration mechanics from each driver class into the
> > ClusterIterator, which uses a ClusterClassifier which is itself an
> > OnlineLearner. This will hopefully allow semi-supervised classifier
> > applications to be constructed by feeding cluster-derived models into the
> > classification process. Still kind of fuzzy at this point but promising
> too.
> >
> > On 2/11/12 2:29 PM, Frank Scholten wrote:
> >
> >> ...
> >>
> >> What kind of clustering refactoring do mean here? I did some work on
> >> creating bean configurations in the past (MAHOUT-612). I underestimated
> the
> >> amount of work required to do the entire refactoring. If this can be
> >> contributed and committed on a per-job basis I would like to help out.
> >>
> >>> ...
> >>>
> >>
> >>
> >
>
>
> --
>
> Thanks,
> John C
>

Re: Goals for Mahout 0.7

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
+1 I think this is an excellent goal. The current code base does not 
approach its Java APIs in a uniform manner nor are we where we had hoped 
to be on the CLI API uniformity. There's a lot to do here in both areas.

In the Java API area, we do have some notable successes, with the 
recommender APIs truly being designed for this kind of invocation. In 
the clustering drivers, we have tried to support native Java access as 
well, though there are a lot of arguments required for most invocations. 
Other drivers have really only been written for CLI access as you note 
and some large amounts of rather simple refactoring would be required to 
present a usable Java API.

The challenge here is that the Java API must account for all of the 
optional CLI arguments of every algorithm. This either leads to ~37 
typed arguments (hyperbole) or a set of helper methods which provide 
useful defaults for use in common situations. Another approach is to 
implement configuration beans which contain all the argument values 
required for full specification.

In the current clustering refactoring under way to utilize the 
ClusterClassifier, arguments are to be provided in ClusteringPolicy 
objects so I'm biased towards the latter approach. We ought to agree 
upon which style we want to take this goal forward, but I am 100% behind it.

Jeff


On 2/13/12 10:31 AM, John Conwell wrote:
> > From my perspective, I'd really like to see the Mahout API migrate away
> from a command line centric design it currently utilizes, and migrate more
> towards an library centric API design.  I think this would go a long way in
> getting Mahout adopted into real life commercial applications.
>
> While there might be a few algorithm drivers that you interact with by
> creating an instance of a class, and calling some method(s) on the instance
> to interact with it (I havent actually seen one like that, but there might
> be a few), many algorithms are invoked by calling some static function on a
> class that takes ~37 typed arguments.  Buts whats worse, many drivers are
> invoked by having to create a String array with ~37 arguments as string
> values, and calling the static main function on the class.
>
> Now I'm not saying that having a static main function available to invoke
> an algorithm from the command line isn't useful.  It is, when your testing
> an algorithm.  But once you want to integrate the algorithm into a
> commercial workflow it kind of sucks.
>
> For example, immagine if the API for invoking Math.max was designed the way
> many of the Mahout algorithms currently are?  You'd have something like
> this:
>
> String[] args = new String[2];
> args[0] = "max";
> args[1] = "7";
> args[0] = "4";
> int max = Math.main(args);
>
> It makes your code a horrible mess and very hard to maintain, as well as
> very prone to bugs.
>
> When I see a bunch of static main functions as the only way to interact
> with a library, no matter what the quality of the library is, my initial
> impression is that this has to be some minimally supported effort by a few
> PhD candidates still in academia, who will drop the project as soon as they
> graduate.  And while this might not be the case, it is one of the first
> impressions it gives, and can lead a company to drop the library from
> consideration before they do any due diligence into its quality and utility.
>
> I think as Mahout matures and gets closer to a 1.0 release, this kind of
> API re-design will become more and more necessary, especially if you want a
> higher Mahout integration rate into commercial applications and workflows.
>
> Also, I hope I dont sound too negative.  I'm very impressed with Mahout and
> its capabilities.  I really like that there is a well thought out class
> library of primitives for designing new serial and distributed machine
> learning algorithms.  And I think it has a high utility for integration
> into highly visible commercial projects.  But its high level public API
> really is a barrier to entry when trying to design commercial applications.
>
>
> On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
>> We have a couple JIRAs that relate here: We want to factor all the (-cl)
>> classification steps out of all of the driver classes (MAHOUT-930) and into
>> a separate job to remove duplicated code; MAHOUT-931 is to add a pluggable
>> outlier removal capability to this job; and MAHOUT-933 is aimed at
>> factoring all the iteration mechanics from each driver class into the
>> ClusterIterator, which uses a ClusterClassifier which is itself an
>> OnlineLearner. This will hopefully allow semi-supervised classifier
>> applications to be constructed by feeding cluster-derived models into the
>> classification process. Still kind of fuzzy at this point but promising too.
>>
>> On 2/11/12 2:29 PM, Frank Scholten wrote:
>>
>>> ...
>>>
>>> What kind of clustering refactoring do mean here? I did some work on
>>> creating bean configurations in the past (MAHOUT-612). I underestimated the
>>> amount of work required to do the entire refactoring. If this can be
>>> contributed and committed on a per-job basis I would like to help out.
>>>
>>>> ...
>>>>
>>>
>


Re: Goals for Mahout 0.7

Posted by Paritosh Ranjan <pr...@xebia.com>.
Recently I gave a presentation on Mahout's capabilities to the sales 
team of my company.
For that I had to prepare some demo apps. I think you would agree that 
Java API would be a good choice to develop something like that quickly.

Though I was able to create Recommendation and Clustering demos using 
Java API's very quickly, however, I was not able to do it for 
Classification.
So, I think that Java API's need some attention.

If any statistics can be found regarding the way mahout usage is 
preferred ( command line or API ), then that can also help in 
prioritization/preference in development.

On 13-02-2012 23:22, Manuel Blechschmidt wrote:
> Hi John, hi All,
> we integrated big parts of the recommendation algorithms in Mahout in our commercial product. One quite important thing that we needed were middleware compliant interfaces. Meaning that it was possible to serialize input datasets and serialize output recommendations.
>
> Meaning we are offering the following interfaces for our recommendation based on Mahout:
>
> Output:
> - Webservices
> - REST
> - RMI
> - HTTP
> - Mail
> - JMS
>
> Input:
> - WebServices
> - RMI
> - JDBC
> - File
> - JMS
>
> I think it could be a big benefit for mahout if it would offer a consistent API across these different technologies. The taste-web application is a good step in this direction.
>
> Further I think that a user interface would help mahout to get bigger acceptance in the user community.
>
> Hope that helps
> /Manuel
>
> On 13.02.2012, at 18:31, John Conwell wrote:
>
>>  From my perspective, I'd really like to see the Mahout API migrate away
>> from a command line centric design it currently utilizes, and migrate more
>> towards an library centric API design.  I think this would go a long way in
>> getting Mahout adopted into real life commercial applications.
>>
>> While there might be a few algorithm drivers that you interact with by
>> creating an instance of a class, and calling some method(s) on the instance
>> to interact with it (I havent actually seen one like that, but there might
>> be a few), many algorithms are invoked by calling some static function on a
>> class that takes ~37 typed arguments.  Buts whats worse, many drivers are
>> invoked by having to create a String array with ~37 arguments as string
>> values, and calling the static main function on the class.
>>
>> Now I'm not saying that having a static main function available to invoke
>> an algorithm from the command line isn't useful.  It is, when your testing
>> an algorithm.  But once you want to integrate the algorithm into a
>> commercial workflow it kind of sucks.
>>
>> For example, immagine if the API for invoking Math.max was designed the way
>> many of the Mahout algorithms currently are?  You'd have something like
>> this:
>>
>> String[] args = new String[2];
>> args[0] = "max";
>> args[1] = "7";
>> args[0] = "4";
>> int max = Math.main(args);
>>
>> It makes your code a horrible mess and very hard to maintain, as well as
>> very prone to bugs.
>>
>> When I see a bunch of static main functions as the only way to interact
>> with a library, no matter what the quality of the library is, my initial
>> impression is that this has to be some minimally supported effort by a few
>> PhD candidates still in academia, who will drop the project as soon as they
>> graduate.  And while this might not be the case, it is one of the first
>> impressions it gives, and can lead a company to drop the library from
>> consideration before they do any due diligence into its quality and utility.
>>
>> I think as Mahout matures and gets closer to a 1.0 release, this kind of
>> API re-design will become more and more necessary, especially if you want a
>> higher Mahout integration rate into commercial applications and workflows.
>>
>> Also, I hope I dont sound too negative.  I'm very impressed with Mahout and
>> its capabilities.  I really like that there is a well thought out class
>> library of primitives for designing new serial and distributed machine
>> learning algorithms.  And I think it has a high utility for integration
>> into highly visible commercial projects.  But its high level public API
>> really is a barrier to entry when trying to design commercial applications.
>>
>>
>> On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
>> <jd...@windwardsolutions.com>wrote:
>>
>>> We have a couple JIRAs that relate here: We want to factor all the (-cl)
>>> classification steps out of all of the driver classes (MAHOUT-930) and into
>>> a separate job to remove duplicated code; MAHOUT-931 is to add a pluggable
>>> outlier removal capability to this job; and MAHOUT-933 is aimed at
>>> factoring all the iteration mechanics from each driver class into the
>>> ClusterIterator, which uses a ClusterClassifier which is itself an
>>> OnlineLearner. This will hopefully allow semi-supervised classifier
>>> applications to be constructed by feeding cluster-derived models into the
>>> classification process. Still kind of fuzzy at this point but promising too.
>>>
>>> On 2/11/12 2:29 PM, Frank Scholten wrote:
>>>
>>>> ...
>>>>
>>>> What kind of clustering refactoring do mean here? I did some work on
>>>> creating bean configurations in the past (MAHOUT-612). I underestimated the
>>>> amount of work required to do the entire refactoring. If this can be
>>>> contributed and committed on a per-job basis I would like to help out.
>>>>
>>>>> ...
>>>>>
>>>>
>>
>> -- 
>>
>> Thanks,
>> John C


Re: Goals for Mahout 0.7

Posted by Manuel Blechschmidt <Ma...@gmx.de>.
Hi John, hi All,
we integrated big parts of the recommendation algorithms in Mahout in our commercial product. One quite important thing that we needed were middleware compliant interfaces. Meaning that it was possible to serialize input datasets and serialize output recommendations.

Meaning we are offering the following interfaces for our recommendation based on Mahout:

Output:
- Webservices
- REST
- RMI
- HTTP
- Mail
- JMS

Input:
- WebServices
- RMI
- JDBC
- File
- JMS

I think it could be a big benefit for mahout if it would offer a consistent API across these different technologies. The taste-web application is a good step in this direction.

Further I think that a user interface would help mahout to get bigger acceptance in the user community.

Hope that helps
/Manuel

On 13.02.2012, at 18:31, John Conwell wrote:

> From my perspective, I'd really like to see the Mahout API migrate away
> from a command line centric design it currently utilizes, and migrate more
> towards an library centric API design.  I think this would go a long way in
> getting Mahout adopted into real life commercial applications.
> 
> While there might be a few algorithm drivers that you interact with by
> creating an instance of a class, and calling some method(s) on the instance
> to interact with it (I havent actually seen one like that, but there might
> be a few), many algorithms are invoked by calling some static function on a
> class that takes ~37 typed arguments.  Buts whats worse, many drivers are
> invoked by having to create a String array with ~37 arguments as string
> values, and calling the static main function on the class.
> 
> Now I'm not saying that having a static main function available to invoke
> an algorithm from the command line isn't useful.  It is, when your testing
> an algorithm.  But once you want to integrate the algorithm into a
> commercial workflow it kind of sucks.
> 
> For example, immagine if the API for invoking Math.max was designed the way
> many of the Mahout algorithms currently are?  You'd have something like
> this:
> 
> String[] args = new String[2];
> args[0] = "max";
> args[1] = "7";
> args[0] = "4";
> int max = Math.main(args);
> 
> It makes your code a horrible mess and very hard to maintain, as well as
> very prone to bugs.
> 
> When I see a bunch of static main functions as the only way to interact
> with a library, no matter what the quality of the library is, my initial
> impression is that this has to be some minimally supported effort by a few
> PhD candidates still in academia, who will drop the project as soon as they
> graduate.  And while this might not be the case, it is one of the first
> impressions it gives, and can lead a company to drop the library from
> consideration before they do any due diligence into its quality and utility.
> 
> I think as Mahout matures and gets closer to a 1.0 release, this kind of
> API re-design will become more and more necessary, especially if you want a
> higher Mahout integration rate into commercial applications and workflows.
> 
> Also, I hope I dont sound too negative.  I'm very impressed with Mahout and
> its capabilities.  I really like that there is a well thought out class
> library of primitives for designing new serial and distributed machine
> learning algorithms.  And I think it has a high utility for integration
> into highly visible commercial projects.  But its high level public API
> really is a barrier to entry when trying to design commercial applications.
> 
> 
> On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
> 
>> We have a couple JIRAs that relate here: We want to factor all the (-cl)
>> classification steps out of all of the driver classes (MAHOUT-930) and into
>> a separate job to remove duplicated code; MAHOUT-931 is to add a pluggable
>> outlier removal capability to this job; and MAHOUT-933 is aimed at
>> factoring all the iteration mechanics from each driver class into the
>> ClusterIterator, which uses a ClusterClassifier which is itself an
>> OnlineLearner. This will hopefully allow semi-supervised classifier
>> applications to be constructed by feeding cluster-derived models into the
>> classification process. Still kind of fuzzy at this point but promising too.
>> 
>> On 2/11/12 2:29 PM, Frank Scholten wrote:
>> 
>>> ...
>>> 
>>> What kind of clustering refactoring do mean here? I did some work on
>>> creating bean configurations in the past (MAHOUT-612). I underestimated the
>>> amount of work required to do the entire refactoring. If this can be
>>> contributed and committed on a per-job basis I would like to help out.
>>> 
>>>> ...
>>>> 
>>> 
>>> 
>> 
> 
> 
> -- 
> 
> Thanks,
> John C

-- 
Manuel Blechschmidt
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B


Re: Goals for Mahout 0.7

Posted by John Conwell <jo...@iamjohn.me>.
Man...I totally borked my Math.max example.  But I'm sure you get the
picture.

On Mon, Feb 13, 2012 at 9:31 AM, John Conwell <jo...@iamjohn.me> wrote:

> From my perspective, I'd really like to see the Mahout API migrate away
> from a command line centric design it currently utilizes, and migrate more
> towards an library centric API design.  I think this would go a long way in
> getting Mahout adopted into real life commercial applications.
>
> While there might be a few algorithm drivers that you interact with by
> creating an instance of a class, and calling some method(s) on the instance
> to interact with it (I havent actually seen one like that, but there might
> be a few), many algorithms are invoked by calling some static function on a
> class that takes ~37 typed arguments.  Buts whats worse, many drivers are
> invoked by having to create a String array with ~37 arguments as string
> values, and calling the static main function on the class.
>
> Now I'm not saying that having a static main function available to invoke
> an algorithm from the command line isn't useful.  It is, when your testing
> an algorithm.  But once you want to integrate the algorithm into a
> commercial workflow it kind of sucks.
>
> For example, immagine if the API for invoking Math.max was designed the
> way many of the Mahout algorithms currently are?  You'd have something like
> this:
>
> String[] args = new String[2];
> args[0] = "max";
> args[1] = "7";
> args[0] = "4";
> int max = Math.main(args);
>
> It makes your code a horrible mess and very hard to maintain, as well as
> very prone to bugs.
>
> When I see a bunch of static main functions as the only way to interact
> with a library, no matter what the quality of the library is, my initial
> impression is that this has to be some minimally supported effort by a few
> PhD candidates still in academia, who will drop the project as soon as they
> graduate.  And while this might not be the case, it is one of the first
> impressions it gives, and can lead a company to drop the library from
> consideration before they do any due diligence into its quality and utility.
>
> I think as Mahout matures and gets closer to a 1.0 release, this kind of
> API re-design will become more and more necessary, especially if you want a
> higher Mahout integration rate into commercial applications and workflows.
>
> Also, I hope I dont sound too negative.  I'm very impressed with Mahout
> and its capabilities.  I really like that there is a well thought out class
> library of primitives for designing new serial and distributed machine
> learning algorithms.  And I think it has a high utility for integration
> into highly visible commercial projects.  But its high level public API
> really is a barrier to entry when trying to design commercial applications.
>
>
> On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman <jdog@windwardsolutions.com
> > wrote:
>
>> We have a couple JIRAs that relate here: We want to factor all the (-cl)
>> classification steps out of all of the driver classes (MAHOUT-930) and into
>> a separate job to remove duplicated code; MAHOUT-931 is to add a pluggable
>> outlier removal capability to this job; and MAHOUT-933 is aimed at
>> factoring all the iteration mechanics from each driver class into the
>> ClusterIterator, which uses a ClusterClassifier which is itself an
>> OnlineLearner. This will hopefully allow semi-supervised classifier
>> applications to be constructed by feeding cluster-derived models into the
>> classification process. Still kind of fuzzy at this point but promising too.
>>
>> On 2/11/12 2:29 PM, Frank Scholten wrote:
>>
>>> ...
>>>
>>> What kind of clustering refactoring do mean here? I did some work on
>>> creating bean configurations in the past (MAHOUT-612). I underestimated the
>>> amount of work required to do the entire refactoring. If this can be
>>> contributed and committed on a per-job basis I would like to help out.
>>>
>>>> ...
>>>>
>>>
>>>
>>
>
>
> --
>
> Thanks,
> John C
>
>


-- 

Thanks,
John C

Re: Goals for Mahout 0.7

Posted by Grant Ingersoll <gs...@apache.org>.
One of our top goals, in my mind, has to be speeding up our tests!  I only wish I knew how given basic attempts at parallelism and Maven have failed miserably.



On Feb 14, 2012, at 3:29 PM, Jeff Eastman wrote:

> +users@
> 
> Just to be clear, I'm not advocating replacing the JIRA process with a new set of green-field goals. Rather, IMHO, having a small number of overarching goals for a release *could* help us focus our efforts (triage our feature JIRAs) and *might* suggest some missing JIRAs that would give that release more completeness, usability and "sizzle" in those few areas. Hopefully more completeness and usability and sizzle than we might otherwise obtain using a scattered, bottom-up approach.
> 
> It's the sort of release planning and priority setting I've observed product managers doing in my many past lives. Of course, fixing defects has a higher priority than adding new features, but giving each release some focus and coherence is a mark of a mature product program. An 80% solution in three areas is not as good as a 100% solution in one. At HP, we used to say "Do a few things well". We've been saying "Well, let's do a few more things" too long.
> 
> On 2/14/12 12:25 PM, Sean Owen wrote:
>> When 0.6 was released, there was an all-time record of open JIRAs --
>> something like 90-100 (I closed maybe 10 quickly.) It's just math:
>> there is a certain level of interest and rate of new requests and
>> issues. There is some level of committer time and energy available to
>> work on them. The former is just getting larger and the latter is
>> shrinking. Neither of these things are the problem per se, and neither
>> is something to be fixed; you can't ask people to not have ideas or
>> issues, and you can't tell people they should be contributing more
>> here.
>> 
>> But I do think it means that it's more urgent than ever to have some
>> strategy to tackle the JIRA, rather than talk about more green-field
>> plans. This has been discussed before, and there were ideas like new
>> JIRA tags, but I don't think it's been more than some labeling of the
>> problem. There haven't been new committers, and JIRA rot is
>> discouraging new ones, which makes it worse.
>> 
>> JIRA is really a symptom; there is just a lot of sprawl and cruft to
>> the project that's not being talked about or addressed.
>> 
>> I can't say don't write down any new plans in JIRA. I can only point
>> out what's happened many times: big ideas go half implemented if at
>> all. Writing them down isn't really useful work. Meanwhile, I can see
>> ten JIRAs from new contributors that have been ignored, and, many new
>> bug reports are avoidable, jsut symptoms of scattered un-unified code
>> that was never refined. It won't be different if this cycle is
>> repeated. It's not going to kill this project but it's not going to
>> get out of AAA to the Major Leagues at this rate, and that is
>> frustrating.
>> 
>> Fortunately, I think this remains pretty solvable. More work on
>> existing issues sure helps, but nobody can count on that. It's then a
>> question of scope: narrowing scope to something maintainable, making
>> that scope clear, turning down JIRAs that don't fit, focusing
>> attention on actionable JIRAs that do. Yes, you have to be able to
>> not-do things in a project as well as do things, even in open source.
>> 
>> I think that scope is still large at "maintaining what exists already,
>> and fixing it up". Since I think this is the only realistic approach
>> to a next version, in this conversation I could not support anything
>> approach that pretends to do five more things in the next version --
>> at least not unless accompanied by some plan to address the
>> contributions already in line in JIRA. It's not OK to be implicitly
>> rejecting so much from the community by not planning to fix that first
>> and foremost.
>> 
>> 
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com




Re: Goals for Mahout 0.7

Posted by Grant Ingersoll <gs...@apache.org>.
One of our top goals, in my mind, has to be speeding up our tests!  I only wish I knew how given basic attempts at parallelism and Maven have failed miserably.



On Feb 14, 2012, at 3:29 PM, Jeff Eastman wrote:

> +users@
> 
> Just to be clear, I'm not advocating replacing the JIRA process with a new set of green-field goals. Rather, IMHO, having a small number of overarching goals for a release *could* help us focus our efforts (triage our feature JIRAs) and *might* suggest some missing JIRAs that would give that release more completeness, usability and "sizzle" in those few areas. Hopefully more completeness and usability and sizzle than we might otherwise obtain using a scattered, bottom-up approach.
> 
> It's the sort of release planning and priority setting I've observed product managers doing in my many past lives. Of course, fixing defects has a higher priority than adding new features, but giving each release some focus and coherence is a mark of a mature product program. An 80% solution in three areas is not as good as a 100% solution in one. At HP, we used to say "Do a few things well". We've been saying "Well, let's do a few more things" too long.
> 
> On 2/14/12 12:25 PM, Sean Owen wrote:
>> When 0.6 was released, there was an all-time record of open JIRAs --
>> something like 90-100 (I closed maybe 10 quickly.) It's just math:
>> there is a certain level of interest and rate of new requests and
>> issues. There is some level of committer time and energy available to
>> work on them. The former is just getting larger and the latter is
>> shrinking. Neither of these things are the problem per se, and neither
>> is something to be fixed; you can't ask people to not have ideas or
>> issues, and you can't tell people they should be contributing more
>> here.
>> 
>> But I do think it means that it's more urgent than ever to have some
>> strategy to tackle the JIRA, rather than talk about more green-field
>> plans. This has been discussed before, and there were ideas like new
>> JIRA tags, but I don't think it's been more than some labeling of the
>> problem. There haven't been new committers, and JIRA rot is
>> discouraging new ones, which makes it worse.
>> 
>> JIRA is really a symptom; there is just a lot of sprawl and cruft to
>> the project that's not being talked about or addressed.
>> 
>> I can't say don't write down any new plans in JIRA. I can only point
>> out what's happened many times: big ideas go half implemented if at
>> all. Writing them down isn't really useful work. Meanwhile, I can see
>> ten JIRAs from new contributors that have been ignored, and, many new
>> bug reports are avoidable, jsut symptoms of scattered un-unified code
>> that was never refined. It won't be different if this cycle is
>> repeated. It's not going to kill this project but it's not going to
>> get out of AAA to the Major Leagues at this rate, and that is
>> frustrating.
>> 
>> Fortunately, I think this remains pretty solvable. More work on
>> existing issues sure helps, but nobody can count on that. It's then a
>> question of scope: narrowing scope to something maintainable, making
>> that scope clear, turning down JIRAs that don't fit, focusing
>> attention on actionable JIRAs that do. Yes, you have to be able to
>> not-do things in a project as well as do things, even in open source.
>> 
>> I think that scope is still large at "maintaining what exists already,
>> and fixing it up". Since I think this is the only realistic approach
>> to a next version, in this conversation I could not support anything
>> approach that pretends to do five more things in the next version --
>> at least not unless accompanied by some plan to address the
>> contributions already in line in JIRA. It's not OK to be implicitly
>> rejecting so much from the community by not planning to fix that first
>> and foremost.
>> 
>> 
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com




Re: Goals for Mahout 0.7

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
+users@

Just to be clear, I'm not advocating replacing the JIRA process with a 
new set of green-field goals. Rather, IMHO, having a small number of 
overarching goals for a release *could* help us focus our efforts 
(triage our feature JIRAs) and *might* suggest some missing JIRAs that 
would give that release more completeness, usability and "sizzle" in 
those few areas. Hopefully more completeness and usability and sizzle 
than we might otherwise obtain using a scattered, bottom-up approach.

It's the sort of release planning and priority setting I've observed 
product managers doing in my many past lives. Of course, fixing defects 
has a higher priority than adding new features, but giving each release 
some focus and coherence is a mark of a mature product program. An 80% 
solution in three areas is not as good as a 100% solution in one. At HP, 
we used to say "Do a few things well". We've been saying "Well, let's do 
a few more things" too long.

On 2/14/12 12:25 PM, Sean Owen wrote:
> When 0.6 was released, there was an all-time record of open JIRAs --
> something like 90-100 (I closed maybe 10 quickly.) It's just math:
> there is a certain level of interest and rate of new requests and
> issues. There is some level of committer time and energy available to
> work on them. The former is just getting larger and the latter is
> shrinking. Neither of these things are the problem per se, and neither
> is something to be fixed; you can't ask people to not have ideas or
> issues, and you can't tell people they should be contributing more
> here.
>
> But I do think it means that it's more urgent than ever to have some
> strategy to tackle the JIRA, rather than talk about more green-field
> plans. This has been discussed before, and there were ideas like new
> JIRA tags, but I don't think it's been more than some labeling of the
> problem. There haven't been new committers, and JIRA rot is
> discouraging new ones, which makes it worse.
>
> JIRA is really a symptom; there is just a lot of sprawl and cruft to
> the project that's not being talked about or addressed.
>
> I can't say don't write down any new plans in JIRA. I can only point
> out what's happened many times: big ideas go half implemented if at
> all. Writing them down isn't really useful work. Meanwhile, I can see
> ten JIRAs from new contributors that have been ignored, and, many new
> bug reports are avoidable, jsut symptoms of scattered un-unified code
> that was never refined. It won't be different if this cycle is
> repeated. It's not going to kill this project but it's not going to
> get out of AAA to the Major Leagues at this rate, and that is
> frustrating.
>
> Fortunately, I think this remains pretty solvable. More work on
> existing issues sure helps, but nobody can count on that. It's then a
> question of scope: narrowing scope to something maintainable, making
> that scope clear, turning down JIRAs that don't fit, focusing
> attention on actionable JIRAs that do. Yes, you have to be able to
> not-do things in a project as well as do things, even in open source.
>
> I think that scope is still large at "maintaining what exists already,
> and fixing it up". Since I think this is the only realistic approach
> to a next version, in this conversation I could not support anything
> approach that pretends to do five more things in the next version --
> at least not unless accompanied by some plan to address the
> contributions already in line in JIRA. It's not OK to be implicitly
> rejecting so much from the community by not planning to fix that first
> and foremost.
>
>


Re: Goals for Mahout 0.7

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
+users@

Just to be clear, I'm not advocating replacing the JIRA process with a 
new set of green-field goals. Rather, IMHO, having a small number of 
overarching goals for a release *could* help us focus our efforts 
(triage our feature JIRAs) and *might* suggest some missing JIRAs that 
would give that release more completeness, usability and "sizzle" in 
those few areas. Hopefully more completeness and usability and sizzle 
than we might otherwise obtain using a scattered, bottom-up approach.

It's the sort of release planning and priority setting I've observed 
product managers doing in my many past lives. Of course, fixing defects 
has a higher priority than adding new features, but giving each release 
some focus and coherence is a mark of a mature product program. An 80% 
solution in three areas is not as good as a 100% solution in one. At HP, 
we used to say "Do a few things well". We've been saying "Well, let's do 
a few more things" too long.

On 2/14/12 12:25 PM, Sean Owen wrote:
> When 0.6 was released, there was an all-time record of open JIRAs --
> something like 90-100 (I closed maybe 10 quickly.) It's just math:
> there is a certain level of interest and rate of new requests and
> issues. There is some level of committer time and energy available to
> work on them. The former is just getting larger and the latter is
> shrinking. Neither of these things are the problem per se, and neither
> is something to be fixed; you can't ask people to not have ideas or
> issues, and you can't tell people they should be contributing more
> here.
>
> But I do think it means that it's more urgent than ever to have some
> strategy to tackle the JIRA, rather than talk about more green-field
> plans. This has been discussed before, and there were ideas like new
> JIRA tags, but I don't think it's been more than some labeling of the
> problem. There haven't been new committers, and JIRA rot is
> discouraging new ones, which makes it worse.
>
> JIRA is really a symptom; there is just a lot of sprawl and cruft to
> the project that's not being talked about or addressed.
>
> I can't say don't write down any new plans in JIRA. I can only point
> out what's happened many times: big ideas go half implemented if at
> all. Writing them down isn't really useful work. Meanwhile, I can see
> ten JIRAs from new contributors that have been ignored, and, many new
> bug reports are avoidable, jsut symptoms of scattered un-unified code
> that was never refined. It won't be different if this cycle is
> repeated. It's not going to kill this project but it's not going to
> get out of AAA to the Major Leagues at this rate, and that is
> frustrating.
>
> Fortunately, I think this remains pretty solvable. More work on
> existing issues sure helps, but nobody can count on that. It's then a
> question of scope: narrowing scope to something maintainable, making
> that scope clear, turning down JIRAs that don't fit, focusing
> attention on actionable JIRAs that do. Yes, you have to be able to
> not-do things in a project as well as do things, even in open source.
>
> I think that scope is still large at "maintaining what exists already,
> and fixing it up". Since I think this is the only realistic approach
> to a next version, in this conversation I could not support anything
> approach that pretends to do five more things in the next version --
> at least not unless accompanied by some plan to address the
> contributions already in line in JIRA. It's not OK to be implicitly
> rejecting so much from the community by not planning to fix that first
> and foremost.
>
>


Re: Goals for Mahout 0.7

Posted by Sean Owen <sr...@gmail.com>.
When 0.6 was released, there was an all-time record of open JIRAs --
something like 90-100 (I closed maybe 10 quickly.) It's just math:
there is a certain level of interest and rate of new requests and
issues. There is some level of committer time and energy available to
work on them. The former is just getting larger and the latter is
shrinking. Neither of these things are the problem per se, and neither
is something to be fixed; you can't ask people to not have ideas or
issues, and you can't tell people they should be contributing more
here.

But I do think it means that it's more urgent than ever to have some
strategy to tackle the JIRA, rather than talk about more green-field
plans. This has been discussed before, and there were ideas like new
JIRA tags, but I don't think it's been more than some labeling of the
problem. There haven't been new committers, and JIRA rot is
discouraging new ones, which makes it worse.

JIRA is really a symptom; there is just a lot of sprawl and cruft to
the project that's not being talked about or addressed.

I can't say don't write down any new plans in JIRA. I can only point
out what's happened many times: big ideas go half implemented if at
all. Writing them down isn't really useful work. Meanwhile, I can see
ten JIRAs from new contributors that have been ignored, and, many new
bug reports are avoidable, jsut symptoms of scattered un-unified code
that was never refined. It won't be different if this cycle is
repeated. It's not going to kill this project but it's not going to
get out of AAA to the Major Leagues at this rate, and that is
frustrating.

Fortunately, I think this remains pretty solvable. More work on
existing issues sure helps, but nobody can count on that. It's then a
question of scope: narrowing scope to something maintainable, making
that scope clear, turning down JIRAs that don't fit, focusing
attention on actionable JIRAs that do. Yes, you have to be able to
not-do things in a project as well as do things, even in open source.

I think that scope is still large at "maintaining what exists already,
and fixing it up". Since I think this is the only realistic approach
to a next version, in this conversation I could not support anything
approach that pretends to do five more things in the next version --
at least not unless accompanied by some plan to address the
contributions already in line in JIRA. It's not OK to be implicitly
rejecting so much from the community by not planning to fix that first
and foremost.

Re: Goals for Mahout 0.7

Posted by John Conwell <jo...@iamjohn.me>.
>From my perspective, I'd really like to see the Mahout API migrate away
from a command line centric design it currently utilizes, and migrate more
towards an library centric API design.  I think this would go a long way in
getting Mahout adopted into real life commercial applications.

While there might be a few algorithm drivers that you interact with by
creating an instance of a class, and calling some method(s) on the instance
to interact with it (I havent actually seen one like that, but there might
be a few), many algorithms are invoked by calling some static function on a
class that takes ~37 typed arguments.  Buts whats worse, many drivers are
invoked by having to create a String array with ~37 arguments as string
values, and calling the static main function on the class.

Now I'm not saying that having a static main function available to invoke
an algorithm from the command line isn't useful.  It is, when your testing
an algorithm.  But once you want to integrate the algorithm into a
commercial workflow it kind of sucks.

For example, immagine if the API for invoking Math.max was designed the way
many of the Mahout algorithms currently are?  You'd have something like
this:

String[] args = new String[2];
args[0] = "max";
args[1] = "7";
args[0] = "4";
int max = Math.main(args);

It makes your code a horrible mess and very hard to maintain, as well as
very prone to bugs.

When I see a bunch of static main functions as the only way to interact
with a library, no matter what the quality of the library is, my initial
impression is that this has to be some minimally supported effort by a few
PhD candidates still in academia, who will drop the project as soon as they
graduate.  And while this might not be the case, it is one of the first
impressions it gives, and can lead a company to drop the library from
consideration before they do any due diligence into its quality and utility.

I think as Mahout matures and gets closer to a 1.0 release, this kind of
API re-design will become more and more necessary, especially if you want a
higher Mahout integration rate into commercial applications and workflows.

Also, I hope I dont sound too negative.  I'm very impressed with Mahout and
its capabilities.  I really like that there is a well thought out class
library of primitives for designing new serial and distributed machine
learning algorithms.  And I think it has a high utility for integration
into highly visible commercial projects.  But its high level public API
really is a barrier to entry when trying to design commercial applications.


On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> We have a couple JIRAs that relate here: We want to factor all the (-cl)
> classification steps out of all of the driver classes (MAHOUT-930) and into
> a separate job to remove duplicated code; MAHOUT-931 is to add a pluggable
> outlier removal capability to this job; and MAHOUT-933 is aimed at
> factoring all the iteration mechanics from each driver class into the
> ClusterIterator, which uses a ClusterClassifier which is itself an
> OnlineLearner. This will hopefully allow semi-supervised classifier
> applications to be constructed by feeding cluster-derived models into the
> classification process. Still kind of fuzzy at this point but promising too.
>
> On 2/11/12 2:29 PM, Frank Scholten wrote:
>
>> ...
>>
>> What kind of clustering refactoring do mean here? I did some work on
>> creating bean configurations in the past (MAHOUT-612). I underestimated the
>> amount of work required to do the entire refactoring. If this can be
>> contributed and committed on a per-job basis I would like to help out.
>>
>>> ...
>>>
>>
>>
>


-- 

Thanks,
John C

Re: Goals for Mahout 0.7

Posted by John Conwell <jo...@iamjohn.me>.
>From my perspective, I'd really like to see the Mahout API migrate away
from a command line centric design it currently utilizes, and migrate more
towards an library centric API design.  I think this would go a long way in
getting Mahout adopted into real life commercial applications.

While there might be a few algorithm drivers that you interact with by
creating an instance of a class, and calling some method(s) on the instance
to interact with it (I havent actually seen one like that, but there might
be a few), many algorithms are invoked by calling some static function on a
class that takes ~37 typed arguments.  Buts whats worse, many drivers are
invoked by having to create a String array with ~37 arguments as string
values, and calling the static main function on the class.

Now I'm not saying that having a static main function available to invoke
an algorithm from the command line isn't useful.  It is, when your testing
an algorithm.  But once you want to integrate the algorithm into a
commercial workflow it kind of sucks.

For example, immagine if the API for invoking Math.max was designed the way
many of the Mahout algorithms currently are?  You'd have something like
this:

String[] args = new String[2];
args[0] = "max";
args[1] = "7";
args[0] = "4";
int max = Math.main(args);

It makes your code a horrible mess and very hard to maintain, as well as
very prone to bugs.

When I see a bunch of static main functions as the only way to interact
with a library, no matter what the quality of the library is, my initial
impression is that this has to be some minimally supported effort by a few
PhD candidates still in academia, who will drop the project as soon as they
graduate.  And while this might not be the case, it is one of the first
impressions it gives, and can lead a company to drop the library from
consideration before they do any due diligence into its quality and utility.

I think as Mahout matures and gets closer to a 1.0 release, this kind of
API re-design will become more and more necessary, especially if you want a
higher Mahout integration rate into commercial applications and workflows.

Also, I hope I dont sound too negative.  I'm very impressed with Mahout and
its capabilities.  I really like that there is a well thought out class
library of primitives for designing new serial and distributed machine
learning algorithms.  And I think it has a high utility for integration
into highly visible commercial projects.  But its high level public API
really is a barrier to entry when trying to design commercial applications.


On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> We have a couple JIRAs that relate here: We want to factor all the (-cl)
> classification steps out of all of the driver classes (MAHOUT-930) and into
> a separate job to remove duplicated code; MAHOUT-931 is to add a pluggable
> outlier removal capability to this job; and MAHOUT-933 is aimed at
> factoring all the iteration mechanics from each driver class into the
> ClusterIterator, which uses a ClusterClassifier which is itself an
> OnlineLearner. This will hopefully allow semi-supervised classifier
> applications to be constructed by feeding cluster-derived models into the
> classification process. Still kind of fuzzy at this point but promising too.
>
> On 2/11/12 2:29 PM, Frank Scholten wrote:
>
>> ...
>>
>> What kind of clustering refactoring do mean here? I did some work on
>> creating bean configurations in the past (MAHOUT-612). I underestimated the
>> amount of work required to do the entire refactoring. If this can be
>> contributed and committed on a per-job basis I would like to help out.
>>
>>> ...
>>>
>>
>>
>


-- 

Thanks,
John C

Re: Goals for Mahout 0.7

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
We have a couple JIRAs that relate here: We want to factor all the (-cl) 
classification steps out of all of the driver classes (MAHOUT-930) and 
into a separate job to remove duplicated code; MAHOUT-931 is to add a 
pluggable outlier removal capability to this job; and MAHOUT-933 is 
aimed at factoring all the iteration mechanics from each driver class 
into the ClusterIterator, which uses a ClusterClassifier which is itself 
an OnlineLearner. This will hopefully allow semi-supervised classifier 
applications to be constructed by feeding cluster-derived models into 
the classification process. Still kind of fuzzy at this point but 
promising too.

On 2/11/12 2:29 PM, Frank Scholten wrote:
> ...
> What kind of clustering refactoring do mean here? I did some work on 
> creating bean configurations in the past (MAHOUT-612). I 
> underestimated the amount of work required to do the entire 
> refactoring. If this can be contributed and committed on a per-job 
> basis I would like to help out.
>> ...
>


Re: Goals for Mahout 0.7

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
+ users@

These are great ideas, and are just the kinds of high level 
conversations I was hoping to engender. From my agile background, I'd 
hope to define 0.7 by a small number of "epic stories", in a subset of 
our overall capabilities, which could focus our attention to a set of 
derivative JIRAs  that will give Mahout a quantum step forward in some 
functional area from our user's perspective. I think maybe 2-3 such 
"epics" are all we can handle in a release. I don't necessarily think 
mine are the right ones either, but are prime for the pump.

If we could only do 2-3 epics, what would they be? Where would the 
biggest contributions lie?

On 2/11/12 9:45 PM, Lance Norskog wrote:
> For incremental improvements, usability and correctness of algorithms.
> The "new" Naive Bayes and SGD algorithms both seem to have trouble
> classifying. Also, interpretation of results. It is hard to summarize
> the quality of results. I often feel like the math-savvy implementors
> print a bunch of numbers and say "that looks right", and the rest of
> us struggle to get an intuition of what's going on and why.
>
> For new features, "Mahout Online" would be great: a web service that
> packages all of the "online" algorithms (tractable speed and memory
> use).
>
> On Sat, Feb 11, 2012 at 1:29 PM, Frank Scholten<fr...@frankscholten.nl>  wrote:
>> I'd like to add solving ClassNotFoundException problems with third
>> party jars in some jobs.
>>
>> I experimented with having seq2sparse uploading a third party jar with
>> analyzer and add it to the DistributedCache. Uploading works but
>> didn't yet get it working inside the Mappers. I have some code lying
>> around for this that can be used as a starting point, including a
>> separate project that has dependencies on Mahout and on an analyzer to
>> test things out.
>>
>> Another thing would be adding or improving the integration tools. For
>> example adding a mysql2seq to cluster text from a SQL database.
>>
>> On Sat, Feb 11, 2012 at 8:01 PM, Jeff Eastman
>> <jd...@windwardsolutions.com>  wrote:
>>> Now that 0.6 is in the box, it seems a good time to start thinking about
>>> 0.7, from a high level goal perspective at least. Here are a couple that
>>> come to mind:
>>>
>>> Target code freeze date August 1, 2012
>>> Get Jenkins working for us again
>>> Complete clustering refactoring and classification convergence
>> What kind of clustering refactoring do mean here? I did some work on
>> creating bean configurations in the past (MAHOUT-612). I
>> underestimated the amount of work required to do the entire
>> refactoring. If this can be contributed and committed on a per-job
>> basis I would like to help out.
>>
>>> ...
>
>


Re: Goals for Mahout 0.7

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
+ users@

These are great ideas, and are just the kinds of high level 
conversations I was hoping to engender. From my agile background, I'd 
hope to define 0.7 by a small number of "epic stories", in a subset of 
our overall capabilities, which could focus our attention to a set of 
derivative JIRAs  that will give Mahout a quantum step forward in some 
functional area from our user's perspective. I think maybe 2-3 such 
"epics" are all we can handle in a release. I don't necessarily think 
mine are the right ones either, but are prime for the pump.

If we could only do 2-3 epics, what would they be? Where would the 
biggest contributions lie?

On 2/11/12 9:45 PM, Lance Norskog wrote:
> For incremental improvements, usability and correctness of algorithms.
> The "new" Naive Bayes and SGD algorithms both seem to have trouble
> classifying. Also, interpretation of results. It is hard to summarize
> the quality of results. I often feel like the math-savvy implementors
> print a bunch of numbers and say "that looks right", and the rest of
> us struggle to get an intuition of what's going on and why.
>
> For new features, "Mahout Online" would be great: a web service that
> packages all of the "online" algorithms (tractable speed and memory
> use).
>
> On Sat, Feb 11, 2012 at 1:29 PM, Frank Scholten<fr...@frankscholten.nl>  wrote:
>> I'd like to add solving ClassNotFoundException problems with third
>> party jars in some jobs.
>>
>> I experimented with having seq2sparse uploading a third party jar with
>> analyzer and add it to the DistributedCache. Uploading works but
>> didn't yet get it working inside the Mappers. I have some code lying
>> around for this that can be used as a starting point, including a
>> separate project that has dependencies on Mahout and on an analyzer to
>> test things out.
>>
>> Another thing would be adding or improving the integration tools. For
>> example adding a mysql2seq to cluster text from a SQL database.
>>
>> On Sat, Feb 11, 2012 at 8:01 PM, Jeff Eastman
>> <jd...@windwardsolutions.com>  wrote:
>>> Now that 0.6 is in the box, it seems a good time to start thinking about
>>> 0.7, from a high level goal perspective at least. Here are a couple that
>>> come to mind:
>>>
>>> Target code freeze date August 1, 2012
>>> Get Jenkins working for us again
>>> Complete clustering refactoring and classification convergence
>> What kind of clustering refactoring do mean here? I did some work on
>> creating bean configurations in the past (MAHOUT-612). I
>> underestimated the amount of work required to do the entire
>> refactoring. If this can be contributed and committed on a per-job
>> basis I would like to help out.
>>
>>> ...
>
>


Re: Goals for Mahout 0.7

Posted by Lance Norskog <go...@gmail.com>.
For incremental improvements, usability and correctness of algorithms.
The "new" Naive Bayes and SGD algorithms both seem to have trouble
classifying. Also, interpretation of results. It is hard to summarize
the quality of results. I often feel like the math-savvy implementors
print a bunch of numbers and say "that looks right", and the rest of
us struggle to get an intuition of what's going on and why.

For new features, "Mahout Online" would be great: a web service that
packages all of the "online" algorithms (tractable speed and memory
use).

On Sat, Feb 11, 2012 at 1:29 PM, Frank Scholten <fr...@frankscholten.nl> wrote:
> I'd like to add solving ClassNotFoundException problems with third
> party jars in some jobs.
>
> I experimented with having seq2sparse uploading a third party jar with
> analyzer and add it to the DistributedCache. Uploading works but
> didn't yet get it working inside the Mappers. I have some code lying
> around for this that can be used as a starting point, including a
> separate project that has dependencies on Mahout and on an analyzer to
> test things out.
>
> Another thing would be adding or improving the integration tools. For
> example adding a mysql2seq to cluster text from a SQL database.
>
> On Sat, Feb 11, 2012 at 8:01 PM, Jeff Eastman
> <jd...@windwardsolutions.com> wrote:
>> Now that 0.6 is in the box, it seems a good time to start thinking about
>> 0.7, from a high level goal perspective at least. Here are a couple that
>> come to mind:
>>
>> Target code freeze date August 1, 2012
>> Get Jenkins working for us again
>> Complete clustering refactoring and classification convergence
>
> What kind of clustering refactoring do mean here? I did some work on
> creating bean configurations in the past (MAHOUT-612). I
> underestimated the amount of work required to do the entire
> refactoring. If this can be contributed and committed on a per-job
> basis I would like to help out.
>
>> ...



-- 
Lance Norskog
goksron@gmail.com

Re: Goals for Mahout 0.7

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
We have a couple JIRAs that relate here: We want to factor all the (-cl) 
classification steps out of all of the driver classes (MAHOUT-930) and 
into a separate job to remove duplicated code; MAHOUT-931 is to add a 
pluggable outlier removal capability to this job; and MAHOUT-933 is 
aimed at factoring all the iteration mechanics from each driver class 
into the ClusterIterator, which uses a ClusterClassifier which is itself 
an OnlineLearner. This will hopefully allow semi-supervised classifier 
applications to be constructed by feeding cluster-derived models into 
the classification process. Still kind of fuzzy at this point but 
promising too.

On 2/11/12 2:29 PM, Frank Scholten wrote:
> ...
> What kind of clustering refactoring do mean here? I did some work on 
> creating bean configurations in the past (MAHOUT-612). I 
> underestimated the amount of work required to do the entire 
> refactoring. If this can be contributed and committed on a per-job 
> basis I would like to help out.
>> ...
>


Re: Goals for Mahout 0.7

Posted by Frank Scholten <fr...@frankscholten.nl>.
I'd like to add solving ClassNotFoundException problems with third
party jars in some jobs.

I experimented with having seq2sparse uploading a third party jar with
analyzer and add it to the DistributedCache. Uploading works but
didn't yet get it working inside the Mappers. I have some code lying
around for this that can be used as a starting point, including a
separate project that has dependencies on Mahout and on an analyzer to
test things out.

Another thing would be adding or improving the integration tools. For
example adding a mysql2seq to cluster text from a SQL database.

On Sat, Feb 11, 2012 at 8:01 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
> Now that 0.6 is in the box, it seems a good time to start thinking about
> 0.7, from a high level goal perspective at least. Here are a couple that
> come to mind:
>
> Target code freeze date August 1, 2012
> Get Jenkins working for us again
> Complete clustering refactoring and classification convergence

What kind of clustering refactoring do mean here? I did some work on
creating bean configurations in the past (MAHOUT-612). I
underestimated the amount of work required to do the entire
refactoring. If this can be contributed and committed on a per-job
basis I would like to help out.

> ...

Re: Goals for Mahout 0.7

Posted by Lance Norskog <go...@gmail.com>.
Yes! Connection R and Mahout within the same JVM is an awesome idea.

Approaching Mahout as a non-mathematician user is frustrating because
of the difficulty in visualizing and tuning results. I've done some
hacky things with KNime and Excel, but the ability to do math-heavy
post-processing and visualization directly would be excellent.


On Tue, Feb 14, 2012 at 12:56 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> I and my company have allocated some time to create some mixed
> environment of R and other "stuff", and, in particular, Mahout. I am
> thinking of a "contributed" project with R where R is enabled to do
> the following roles:
>
> #1 Mahout's front end driver mixing Mahout computations and R vector/matrices
> #2 data vectorization/preparation routines loaded into backend of
> Mahout's abstract job and adapted to write DRM;
> #3 perhaps some routines allowing subsampling & subsequent
> visulalization of Mahout result for prototyping and control purposes.
>
>
> #2 kind of comes close to what R-Hadoop project does with their
> mapreduce package but unfortunately it looks like that project focuses
> on a particular way of serialization of R objects and adaptation for
> DRM serialization doesn't seem plausible at this time. Besides, I am
> thinking that it's not so difficult to run R from inside mapper
> (R-Hadoop uses streaming, but i think it's worth to try R inverse java
> package instead of streaming and bypass the whole text/parse routine
> completely).
>
> Rapid prototyping and visualization of results i think is one of the
> bigger barriers to Mahout adoption. But enabling mixed environment for
> cpu-laden computations in R is a huge leap towards prototyping big
> data pipeline IMO. Or at least it seems from the vantage point of
> problems i am currently with. Rapid prototyping of Mahout pipelines
> may be a huge help, esp. as new methods become available to try and
> validate.
>
> -d
>
> On Sat, Feb 11, 2012 at 11:01 AM, Jeff Eastman
> <jd...@windwardsolutions.com> wrote:
>> Now that 0.6 is in the box, it seems a good time to start thinking about
>> 0.7, from a high level goal perspective at least. Here are a couple that
>> come to mind:
>>
>> Target code freeze date August 1, 2012
>> Get Jenkins working for us again
>> Complete clustering refactoring and classification convergence
>> ...



-- 
Lance Norskog
goksron@gmail.com

Re: Goals for Mahout 0.7

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
I and my company have allocated some time to create some mixed
environment of R and other "stuff", and, in particular, Mahout. I am
thinking of a "contributed" project with R where R is enabled to do
the following roles:

#1 Mahout's front end driver mixing Mahout computations and R vector/matrices
#2 data vectorization/preparation routines loaded into backend of
Mahout's abstract job and adapted to write DRM;
#3 perhaps some routines allowing subsampling & subsequent
visulalization of Mahout result for prototyping and control purposes.


#2 kind of comes close to what R-Hadoop project does with their
mapreduce package but unfortunately it looks like that project focuses
on a particular way of serialization of R objects and adaptation for
DRM serialization doesn't seem plausible at this time. Besides, I am
thinking that it's not so difficult to run R from inside mapper
(R-Hadoop uses streaming, but i think it's worth to try R inverse java
package instead of streaming and bypass the whole text/parse routine
completely).

Rapid prototyping and visualization of results i think is one of the
bigger barriers to Mahout adoption. But enabling mixed environment for
cpu-laden computations in R is a huge leap towards prototyping big
data pipeline IMO. Or at least it seems from the vantage point of
problems i am currently with. Rapid prototyping of Mahout pipelines
may be a huge help, esp. as new methods become available to try and
validate.

-d

On Sat, Feb 11, 2012 at 11:01 AM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
> Now that 0.6 is in the box, it seems a good time to start thinking about
> 0.7, from a high level goal perspective at least. Here are a couple that
> come to mind:
>
> Target code freeze date August 1, 2012
> Get Jenkins working for us again
> Complete clustering refactoring and classification convergence
> ...