You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2009/09/04 18:13:15 UTC

What's the plan for Mahout?

Guys, quick and broad question -- what's the roadmap for Mahout look
like? Even just for the next two releases?

Now, much of the project is mostly a space for tinkering, tossing
around bits of code for now, and that's OK for 0.1 or 0.2. I just
wonder what the path to a proper finished product is like. It'll take
some agreement on who exactly the audience is, what they need and
don't need, what interface it presents to those users. It takes work
to design for that, bring the project into line around that design,
document and test, etc. And -- it takes people with responsibility and
authority to make it happen.

I'm not clear we quite have those things yet. Until we do this will be
an 0.x project that nobody can really get into using for production.
It doesn't have to happen tomorrow, but, what's our path like from
here to there? Spare time from even 10 people won't get the docs
written, tidy the code, refactor / redesign / unify the lot of
copy/paste that's going on, etc. People definitely have ideas about
what the project should do -- I see lots of little bits of
functionality being thrown into the pot. But is it adding up to
something consistent and coherent? should we talk seriously about it?
"Machine learning" is too broad a remit.

It's not ruining my day or anything but I'm sitting on a piece of the
project that I put effort into making clearly do a few things, do them
well, and not try to do other things, designed for practical use
cases, and documented and polished and tested it. So I'll be a little
concerned if it's attached to an early-0.x tinkering project this time
next year. That's not cool for an Apache project anyway.

It may be presumptuous but I volunteer to try to lead answers to these
questions. It's going to lead to some tough answers and more work in
some cases, no matter who drives it. Hoping to do it sooner than
later.

Re: What's the plan for Mahout?

Posted by Ted Dunning <te...@gmail.com>.
I think that Mahout is more along the lines of the hadoop gestation.  There
was initially one need for that (Nutch simplificaition) and it has grown
into quite something.  It is also only new verging on beginning to think
about what 1.0 should be.  That is fine by me.

On Fri, Sep 4, 2009 at 2:03 PM, Grant Ingersoll <gs...@apache.org> wrote:

>
> It took Lucene 6+ years to reach what I would call a really capable system.
>  The early stages were promising and worked for many, but
> it was not until 2004-05 that it really started taking off.  Not saying
> that Mahout will take that long, especially given how widely
> adopted/accepted Open Source is now as compared to the early days of Lucene,
> but it does take time.  That being said, we certainly need to get more
> people looking at more parts of the code and proposing and implementing
> improvements.




-- 
Ted Dunning, CTO
DeepDyve

Re: What's the plan for Mahout?

Posted by Grant Ingersoll <gs...@apache.org>.
First off, thanks for bringing this up!

On Sep 4, 2009, at 9:13 AM, Sean Owen wrote:

> Guys, quick and broad question -- what's the roadmap for Mahout look
> like? Even just for the next two releases?

I asked a little while back about this.  I think we can put out 0.2  
out after Robin and Deneche get their pieces in (Random Forests,  
classification refactoring), which hopefully should be soon since they  
are now committers.

We've cleaned up a lot and made a number of improvements in the code  
since 0.1, it would be good to get them out to a broader audience.

After that, I don't think we particularly have to go through all the  
0.X (0.3, 0.4, ...) integers on our way to 1.0.  The primary goal  
before 1.0 is to make sure we are happy with the APIs before (to some  
extent) "locking them down" for 1.0, but I'm not sure we need to be  
that worried about locking down, since most of our code isn't public  
APIs anyway and we need not necessarily worry about back  
compatibility.  I think the other primary thing we need is to get some  
larger scale testing in place.   I believe Amazon still has in place  
it's committers program such that committers can get access to EC2  
credits for testing.  Let me know if anyone needs an account.


>
> Now, much of the project is mostly a space for tinkering, tossing
> around bits of code for now, and that's OK for 0.1 or 0.2. I just
> wonder what the path to a proper finished product is like. It'll take
> some agreement on who exactly the audience is, what they need and
> don't need, what interface it presents to those users. It takes work
> to design for that, bring the project into line around that design,
> document and test, etc. And -- it takes people with responsibility and
> authority to make it happen.

I think what we have now goes beyond tinkering, but yes, we are  
exploring what works and what doesn't.  We've got several active  
committers and some active contributors, which are all good signs and  
we actually have a pretty healthy base of mailing list subscribers  
lurking.  We also have users coming in and kicking the tires, we need  
to capture their needs and keep them interested by responding quickly  
and in a helpful way.  We also need to find a way to pull the lurkers  
out to help by providing an ever more compelling story.

Open source is always incremental and it takes time to build.  It  
really is never done and I find O/S is often much more fluid than  
products.



>
> I'm not clear we quite have those things yet. Until we do this will be
> an 0.x project that nobody can really get into using for production.
> It doesn't have to happen tomorrow, but, what's our path like from
> here to there? Spare time from even 10 people won't get the docs
> written, tidy the code, refactor / redesign / unify the lot of
> copy/paste that's going on, etc. People definitely have ideas about
> what the project should do -- I see lots of little bits of
> functionality being thrown into the pot. But is it adding up to
> something consistent and coherent? should we talk seriously about it?
> "Machine learning" is too broad a remit.

I think we are getting there.  Some of the answer is above in the  
first part where I talk about releases.  I do think the bits are  
adding up to real machine learning functionality.  We've got utilities  
in place for getting data into formats that are consumable, we've got  
implementations that consume those formats and produce outputs.  More  
examples, etc. will always help and of course documentation.

It took Lucene 6+ years to reach what I would call a really capable  
system.  The early stages were promising and worked for many, but
it was not until 2004-05 that it really started taking off.  Not  
saying that Mahout will take that long, especially given how widely  
adopted/accepted Open Source is now as compared to the early days of  
Lucene, but it does take time.  That being said, we certainly need to  
get more people looking at more parts of the code and proposing and  
implementing improvements.

>
> It's not ruining my day or anything but I'm sitting on a piece of the
> project that I put effort into making clearly do a few things, do them
> well, and not try to do other things, designed for practical use
> cases, and documented and polished and tested it. So I'll be a little
> concerned if it's attached to an early-0.x tinkering project this time
> next year. That's not cool for an Apache project anyway.

Agreed.  Let's get what is marked for 0.2 done and look to release  
soon thereafter (mid October?)  From there, I likely guess we could do  
a 0.3 (or even 0.9) in the early Jan.- March time frame and then look  
to make a 1.0 in early Summer.  People contributing and pushing can  
obviously push this up.  Our job as the committers is to make sure, to  
some extent, that their efforts don't go wasted.

>
> It may be presumptuous but I volunteer to try to lead answers to these
> questions. It's going to lead to some tough answers and more work in
> some cases, no matter who drives it. Hoping to do it sooner than
> later.

Not at all presumptuous.  This is in fact how it works at Apache.   
Right or wrong, those who do get to make the decisions.  That's how  
the meritocracy works.  I personally am committed and I know several  
others are (obviously, including you) as well.

-Grant

Re: What's the plan for Mahout?

Posted by Sean Owen <sr...@gmail.com>.
I agree with your high-level breakdown, good. I wouldn't say I'm in a
hurry yet, just wishing to ask the questions and start to form a plan.
I don't mind if it's just coming into something polished in a year --
would be worried if a year passes and it's not clear where it's going.

I appreciate this is necessarily volunteer work from people's busy
schedule and we can't afford to invest a ton of mental energy as if
this is our full-time startup or something. That said if it's worth
smart people dumping time into, might as well make sure this is on
track to be a world-class reference library.


I'm interested in continuing the conversation you've continued here --


- Is it fair to say that Mahout is basically machine learning stuff,
focused on large scale? particularly, focused on Hadoop? that would be
pretty coherent. Then we probably need some consistent approach to
Hadoop integration, some very step-by-step documentation, and may want
to start reorganizing the code a little to align with this goal.

- So we are specifically not focused on stuff that's not distributed
Hadoop-based jobs?

- The audience is definitely developers it seems. Nobody's trying to
put a GUI on this.

- I don't mind "stuff" in the library really though I do like a clear
sense of what it is, and isn't, that might guide us as to what to keep
and what to remove.

- Would we like to make a goal for 0.3 to present some unified,
designed approach to Hadoop integration?

- Maybe for, say, 0.4 we have a big 'cookbook' and lots of ready-made examples?


... these are the sorts of things, nothing big, I'd love to talk about
sooner than later. Then that guides future work and we can check our
progress against it. Then we have a clear identity to put out there.


On Fri, Sep 4, 2009 at 7:07 PM, Ted Dunning<te...@gmail.com> wrote:
> These are good questions to ask.  I don't know that we are ready to answer
> them, but I do think that we have pieces of the answers.
>
> So far, there are three or four general themes that seem to be of real
> interest/value
>
> a) taste/collaborative filtering/cooccurrence analysis
>
> b) facilitation of conventional machine learning by large scale aggregation
> using hadoop (so far, this is largely cooccurrence counting)
>
> c) standard and basic machine learning tasks like clustering, simple
> classifiers running on large scale data
>
> d) stuff
>
> There is definitely pull for something like (a) both in the form of a CF
> library roughly equivalent to lucene.  I know that I have a need for (b) and
> occasionally (c).
>
> It seems reasonable that we can provide a coherent story for (a), (b) and
> (c).  If that is true, then (d) can go along for the ride.
>
> The fact is, however, 99% of the machine learning that I do is quite doable
> in a conventional system like R, although some of that 99% needs (b).  Very
> occasionally I need algorithms to run at large scale, but those systems
> always involve quite a bit of engineering to connect the data fire-hoses
> into the right spigots.  I don't think that my experience all that unusual,
> either.
>
> Do other people share Sean's sense of urgency?
>
> Is my break-down a reasonable one?
>
> On Fri, Sep 4, 2009 at 9:13 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> It may be presumptuous but I volunteer to try to lead answers to these
>> questions. It's going to lead to some tough answers and more work in
>> some cases, no matter who drives it. Hoping to do it sooner than
>> later.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: What's the plan for Mahout?

Posted by Isabel Drost <is...@apache.org>.
On Monday 07 September 2009 18:41:10 Ted Dunning wrote:
> I would say that abstracting away from hadoop is a huge issue that we
> definitely don't need to worry about right now.

+1 

Isabel

-- 
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_  
 |,4-  ) )-,_..;\ (  `'-' 
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>


Re: What's the plan for Mahout?

Posted by Ted Dunning <te...@gmail.com>.
I would say that abstracting away from hadoop is a huge issue that we
definitely don't need to worry about right now.

Even the hadoop guys haven't figured out the right interface yet!

On Mon, Sep 7, 2009 at 12:32 AM, Lukáš Vlček <lu...@gmail.com> wrote:

> just a note: Wouldn't it be better to talk about MapReduce as opposed to
> Hadoop? This means that for each algorithm implemented in Mahout it should
> be clearly stated wheter it is MapReduce based implementation or not (or
> using other ways to make it scalable).
>



-- 
Ted Dunning, CTO
DeepDyve

Re: What's the plan for Mahout?

Posted by Sean Owen <sr...@gmail.com>.
Well that is true. I think I'm stating the obvious when I say there is
also a danger in too little direction and focus. I imagine we will
find a happy medium somewhere in between.

On Mon, Sep 7, 2009 at 1:56 PM, Grant Ingersoll<gs...@apache.org> wrote:
>
> The hard thing about all of this is, in open source, you never know where
> the next good idea is coming from, especially in community-driven projects
> (as opposed to the "benevolent dictator" models where one or two people
> drive the whole thing.)  You can plan all you want, but when someone comes
> along with some really nice idea that doesn't fit into your plans, it's
> pretty hard to turn them away when it meets with the general goals of the
> project.  For instance, there is a PLSI implementation sitting in JIRA that
> just so happens to be implemented in Pig.  I can't say Pig was in my
> original plans, but I have no objection to it and plan on committing it once
> I review it.  Even LDA and the frequent pattern mining weren't in the
> "original" plans, yet I think they are welcome additions.
>
> That's not to say we shouldn't clean things up and do some planning, but as
> with everything, it's all going to be driven by how people contribute and
> who takes on the work.  The plus side to cleaning up, etc. is that it should
> make it easier for people to contribute.
>
> -Grant
>
>

Re: What's the plan for Mahout?

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 7, 2009, at 4:49 AM, Sean Owen wrote:

> I am sure the project needs to refactor and unify the Hadoop-related
> code. There's a lot of copy and paste at this stage. That would go
> some way towards abstracting away Hadoop -- would tend to centralize
> the dependency.
>
> I think there's a lot more to it -- abstracting away contacting a
> cluster? running a job? storing and reading data? Then you're also
> learning how to configure Mahout's layer, as well as your underlying
> infrastructure. My gut says it's hard, compared to the value it could
> add. Given that Hadoop is the de facto standard and big clouds like
> Amazon directly support it, it seems unlikely someone would not be
> able to use Hadoop. It's all just my guess given my impressions...
>
> My meta-concern is that we don't really have a polished, finished
> approach to using even Hadoop (which is again to be expected given
> it's early, and given Hadoop is evolving fast too) -- so would rather
> focus on tying up loose ends, or documenting and testing, before
> reaching too much farther.
>

The hard thing about all of this is, in open source, you never know  
where the next good idea is coming from, especially in community- 
driven projects (as opposed to the "benevolent dictator" models where  
one or two people drive the whole thing.)  You can plan all you want,  
but when someone comes along with some really nice idea that doesn't  
fit into your plans, it's pretty hard to turn them away when it meets  
with the general goals of the project.  For instance, there is a PLSI  
implementation sitting in JIRA that just so happens to be implemented  
in Pig.  I can't say Pig was in my original plans, but I have no  
objection to it and plan on committing it once I review it.  Even LDA  
and the frequent pattern mining weren't in the "original" plans, yet I  
think they are welcome additions.

That's not to say we shouldn't clean things up and do some planning,  
but as with everything, it's all going to be driven by how people  
contribute and who takes on the work.  The plus side to cleaning up,  
etc. is that it should make it easier for people to contribute.

-Grant


Re: What's the plan for Mahout?

Posted by Sean Owen <sr...@gmail.com>.
I am sure the project needs to refactor and unify the Hadoop-related
code. There's a lot of copy and paste at this stage. That would go
some way towards abstracting away Hadoop -- would tend to centralize
the dependency.

I think there's a lot more to it -- abstracting away contacting a
cluster? running a job? storing and reading data? Then you're also
learning how to configure Mahout's layer, as well as your underlying
infrastructure. My gut says it's hard, compared to the value it could
add. Given that Hadoop is the de facto standard and big clouds like
Amazon directly support it, it seems unlikely someone would not be
able to use Hadoop. It's all just my guess given my impressions...

My meta-concern is that we don't really have a polished, finished
approach to using even Hadoop (which is again to be expected given
it's early, and given Hadoop is evolving fast too) -- so would rather
focus on tying up loose ends, or documenting and testing, before
reaching too much farther.

On Mon, Sep 7, 2009 at 9:02 AM, Lukáš Vlček<lu...@gmail.com> wrote:
> May be there is no direct equivalent but there are many ways how one can
> build MapReduce architecture into existing system without Hadoop. And there
> is something all these systems have in common at high level. I can see many
> existing systems are adding MapReduce paradigm into their stack (e.g.:
> Aster, GigaSpaces, ... to name a few). Do you think it would be too dificult
> or impractical at this point to target clean design of algorithms in Mahout
> and make then pure MapReduce as opposed to coupled with Hadoop? MapReduce
> API can be just set of few interfaces (and I think there are already such
> interfaces in Hadoop but I don't think you can get then as a separated JAR).
> The rest of the Hadoop dependencies (like using HDFS) can be abstracted
> later if needed.
> Think of a developer who would like to use Mahout but can not use Hadoop.
> For such developer it would be "just" a matter of adapting Mahout to his/her
> proprietar MapReduce system. I am not saying Mahout should have this
> capability now but would be a nice goal.
>
> Regards,
> Lukas

Re: What's the plan for Mahout?

Posted by Lukáš Vlček <lu...@gmail.com>.
May be there is no direct equivalent but there are many ways how one can
build MapReduce architecture into existing system without Hadoop. And there
is something all these systems have in common at high level. I can see many
existing systems are adding MapReduce paradigm into their stack (e.g.:
Aster, GigaSpaces, ... to name a few). Do you think it would be too dificult
or impractical at this point to target clean design of algorithms in Mahout
and make then pure MapReduce as opposed to coupled with Hadoop? MapReduce
API can be just set of few interfaces (and I think there are already such
interfaces in Hadoop but I don't think you can get then as a separated JAR).
The rest of the Hadoop dependencies (like using HDFS) can be abstracted
later if needed.
Think of a developer who would like to use Mahout but can not use Hadoop.
For such developer it would be "just" a matter of adapting Mahout to his/her
proprietar MapReduce system. I am not saying Mahout should have this
capability now but would be a nice goal.

Regards,
Lukas


On Mon, Sep 7, 2009 at 9:42 AM, Sean Owen <sr...@gmail.com> wrote:

> I don't know of any other viable alternative at the moment, and I
> think any alternative would be sufficiently different that it would be
> hard to meaningfully abstract it away without inventing our own little
> mapreduce layer. It still doesn't save anyone from thinking about the
> details of configuring the underlying implementation -- in fact, now
> they have to worry about configuring Mahout-style mapreduce layer as
> well.
>
> (In comparison, take a look at something as simple as logging. Through
> people inventing abstractions, and abstractions on abstractions, it's
> actually turned into something difficult to manage. Using SL4FJ,
> putting in the right bindings .jar so it routes through Log4J -- and
> don't forget log4j.xml -- which you have to use because your
> dependencies use it, and then, what about that library that will try
> to select Log4J or Commons on its own, but it's using Commons because
> it found it in the classpath, and now you don't remember which file
> configures that, and...)
>
>
> On Mon, Sep 7, 2009 at 8:32 AM, Lukáš Vlček<lu...@gmail.com> wrote:
> > Hi,
> > just a note: Wouldn't it be better to talk about MapReduce as opposed to
> > Hadoop? This means that for each algorithm implemented in Mahout it
> should
> > be clearly stated wheter it is MapReduce based implementation or not (or
> > using other ways to make it scalable). I can imagine it could be useful
> to
> > abstract from Hadoop to the point where it would be possible to use
> > different MapReduce providers. I am not sure wheter there is any
> consensus
> > about how MapReduce interfaces API should look like but Mahout could be a
> > good candidate for a project to define and create abstract MapReduce API.
>

Re: What's the plan for Mahout?

Posted by Lukáš Vlček <lu...@gmail.com>.
>
>
> (In comparison, take a look at something as simple as logging. Through
> people inventing abstractions, and abstractions on abstractions, it's
> actually turned into something difficult to manage. Using SL4FJ,
> putting in the right bindings .jar so it routes through Log4J -- and
> don't forget log4j.xml -- which you have to use because your
> dependencies use it, and then, what about that library that will try
> to select Log4J or Commons on its own, but it's using Commons because
> it found it in the classpath, and now you don't remember which file
> configures that, and...)
>

This is a nice point. I don't think that the original author of commons
logging wanted his system to become "de facto" standard and logging
abstraction system: http://radio.weblogs.com/0122027/2003/08/15.html
But commons logging grew and spreaded into so many projects that it is
really hard to get rid of it now. Remarkable piece of history to learn from
:-)


>
>
> On Mon, Sep 7, 2009 at 8:32 AM, Lukáš Vlček<lu...@gmail.com> wrote:
> > Hi,
> > just a note: Wouldn't it be better to talk about MapReduce as opposed to
> > Hadoop? This means that for each algorithm implemented in Mahout it
> should
> > be clearly stated wheter it is MapReduce based implementation or not (or
> > using other ways to make it scalable). I can imagine it could be useful
> to
> > abstract from Hadoop to the point where it would be possible to use
> > different MapReduce providers. I am not sure wheter there is any
> consensus
> > about how MapReduce interfaces API should look like but Mahout could be a
> > good candidate for a project to define and create abstract MapReduce API.
>

Re: What's the plan for Mahout?

Posted by Sean Owen <sr...@gmail.com>.
I don't know of any other viable alternative at the moment, and I
think any alternative would be sufficiently different that it would be
hard to meaningfully abstract it away without inventing our own little
mapreduce layer. It still doesn't save anyone from thinking about the
details of configuring the underlying implementation -- in fact, now
they have to worry about configuring Mahout-style mapreduce layer as
well.

(In comparison, take a look at something as simple as logging. Through
people inventing abstractions, and abstractions on abstractions, it's
actually turned into something difficult to manage. Using SL4FJ,
putting in the right bindings .jar so it routes through Log4J -- and
don't forget log4j.xml -- which you have to use because your
dependencies use it, and then, what about that library that will try
to select Log4J or Commons on its own, but it's using Commons because
it found it in the classpath, and now you don't remember which file
configures that, and...)


On Mon, Sep 7, 2009 at 8:32 AM, Lukáš Vlček<lu...@gmail.com> wrote:
> Hi,
> just a note: Wouldn't it be better to talk about MapReduce as opposed to
> Hadoop? This means that for each algorithm implemented in Mahout it should
> be clearly stated wheter it is MapReduce based implementation or not (or
> using other ways to make it scalable). I can imagine it could be useful to
> abstract from Hadoop to the point where it would be possible to use
> different MapReduce providers. I am not sure wheter there is any consensus
> about how MapReduce interfaces API should look like but Mahout could be a
> good candidate for a project to define and create abstract MapReduce API.

Re: What's the plan for Mahout?

Posted by Lukáš Vlček <lu...@gmail.com>.
Hi,
just a note: Wouldn't it be better to talk about MapReduce as opposed to
Hadoop? This means that for each algorithm implemented in Mahout it should
be clearly stated wheter it is MapReduce based implementation or not (or
using other ways to make it scalable). I can imagine it could be useful to
abstract from Hadoop to the point where it would be possible to use
different MapReduce providers. I am not sure wheter there is any consensus
about how MapReduce interfaces API should look like but Mahout could be a
good candidate for a project to define and create abstract MapReduce API.

Regards,
Lukas


On Sun, Sep 6, 2009 at 5:56 PM, Ted Dunning <te...@gmail.com> wrote:

> I see this as a critical issue.
>
> On Sun, Sep 6, 2009 at 8:31 AM, Isabel Drost <is...@apache.org> wrote:
>
> >
> > > but those systems always involve quite a bit of engineering to connect
> > the
> > > data fire-hoses into the right spigots.
> >
> > I wonder whether there is any way we can make that easier for users? We
> > certainly cannot support all use cases, but at least for text mining we
> > already have some glue code in place.
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: What's the plan for Mahout?

Posted by Ted Dunning <te...@gmail.com>.
I see this as a critical issue.

On Sun, Sep 6, 2009 at 8:31 AM, Isabel Drost <is...@apache.org> wrote:

>
> > but those systems always involve quite a bit of engineering to connect
> the
> > data fire-hoses into the right spigots.
>
> I wonder whether there is any way we can make that easier for users? We
> certainly cannot support all use cases, but at least for text mining we
> already have some glue code in place.




-- 
Ted Dunning, CTO
DeepDyve

Re: What's the plan for Mahout?

Posted by Isabel Drost <is...@apache.org>.
On Saturday 05 September 2009 17:30:14 Grant Ingersoll wrote:
> we are a machine learning project with a commercial 
> friendly license and a solid community aiming to build fast, production
> ready libraries.

+1 I think that summarizes pretty well what I see in Mahout as well.


> Java, Hadoop and distributed are important, but secondary in my mind.

+1


> +1 to coherent API, but that is always evolving, too.

+1 - I see application developers that are maybe a tiny little bit familiar 
with machine learning as our audience. For those playing around with the 
algorithms and clicking workflows through a GUI there are other projects that 
are far more suitable, I'd say.


> but those systems always involve quite a bit of engineering to connect the
> data fire-hoses into the right spigots.

I wonder whether there is any way we can make that easier for users? We 
certainly cannot support all use cases, but at least for text mining we 
already have some glue code in place.

Isabel

-- 
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_  
 |,4-  ) )-,_..;\ (  `'-' 
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>


Re: What's the plan for Mahout?

Posted by Sean Owen <sr...@gmail.com>.
Practically speaking, to guide short-term goals, we do need to start
with a narrower, coherent remit and expand later. Starting as a
Java-based, Hadoop-based library for developers, focusing on
collaborative filtering, clustering, categorization, and a few other
things sounds just right.

It would be bad to think, we'll, we're about anything
machine-learning-related at all, and take a couple steps in 10
different directions, rather than start by thoroughly exploring a
couple. But nobody is saying that, it seems. Let's start by being a
great library as described above.

To that end I do want to push on...
1) Unifying our Hadoop integration -- well, once Hadoop sorts itself
out again. 0.20.0 doesn't really 'work' it seems
2) Unifying the code base -- see message about the common and utils
package for instance

If we do stuff like this we really are going to arrive at a useful,
polished, coherent product soon.

Sean

On Sat, Sep 5, 2009 at 4:30 PM, Grant Ingersoll<gs...@apache.org> wrote:
> I don't think we necessarily need to be distributed or Hadoop based, but
> those are what we led with so far and its a good start.  The nice thing is
> the stuff works just fine in standalone mode, too.   First and foremost, we
> are a machine learning project with a commercial friendly license and a
> solid community aiming to build fast, production ready libraries.  Java,
> Hadoop and distributed are important, but secondary in my mind.  There will
> certainly be some algorithms that we can't implement in Hadoop.  See

Re: What's the plan for Mahout?

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 5, 2009, at 9:41 AM, Sean Owen wrote:

> To kind of wrap this up for now --
>
> I hear some consensus that Mahout is about distributed, Hadoop-based
> solutions for developers. So let's make sure we present a clean,
> coherent API to developers wanting to run the project's Hadoop jobs.

I don't think we necessarily need to be distributed or Hadoop based,  
but those are what we led with so far and its a good start.  The nice  
thing is the stuff works just fine in standalone mode, too.   First  
and foremost, we are a machine learning project with a commercial  
friendly license and a solid community aiming to build fast,  
production ready libraries.  Java, Hadoop and distributed are  
important, but secondary in my mind.  There will certainly be some  
algorithms that we can't implement in Hadoop.  See http://www.lucidimagination.com/search/document/ab7915e98d707194/thought_offering_ec2_s3_based_services 
.

+1 to coherent API, but that is always evolving, too.


>
> I think we're a little bit stuck now as Hadoop 0.20.0 is a little bit
> busted. But as it moves forward, perhaps I can volunteer to suggest
> changes to unify the various jobs, mappers, reducers, etc. across the
> project.
>

Cool.


> Sean
>
> On Fri, Sep 4, 2009 at 11:21 PM, Grant  
> Ingersoll<gs...@apache.org> wrote:
>>
>> On Sep 4, 2009, at 1:07 PM, Ted Dunning wrote:
>>
>>> These are good questions to ask.  I don't know that we are ready  
>>> to answer
>>> them, but I do think that we have pieces of the answers.
>>>
>>> So far, there are three or four general themes that seem to be of  
>>> real
>>> interest/value
>>>
>>> a) taste/collaborative filtering/cooccurrence analysis
>>>
>>> b) facilitation of conventional machine learning by large scale
>>> aggregation
>>> using hadoop (so far, this is largely cooccurrence counting)
>>>
>>> c) standard and basic machine learning tasks like clustering, simple
>>> classifiers running on large scale data
>>>
>>> d) stuff
>>
>> I'd add a few non-technical things I find useful:
>>
>> e)  Non-viral License
>>
>> f) Community supporting it (i.e. not abandoned) and a place to get  
>> answers
>> about practical problems.
>>
>> I've been frustrated more than once by the lack of (e) and (f) on  
>> some other
>> projects.  Not that I'm saying we solve (f) yet completely (could  
>> use a bit
>> more diversity in people answering, but that is starting to take  
>> hold, too),
>> but I do firmly believe Apache is one of the best places to build a
>> community.
>>
>> -Grant
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: What's the plan for Mahout?

Posted by Tanton Gibbs <ta...@gmail.com>.
+1  The scalable part is extremely important.  Perhaps we could add
robust as well, to ensure that the project does not become academic in
nature.

On Sat, Sep 5, 2009 at 1:14 PM, Ted Dunning<te...@gmail.com> wrote:
> I would say that Mahout is about scalable machine learning solutions.  Those
> may use Hadoop.  Or not.  The emphasis is on scaling.
>
> On Sat, Sep 5, 2009 at 6:41 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> I hear some consensus that Mahout is about distributed, Hadoop-based
>> solutions for developers. So let's make sure we present a clean,
>> coherent API to developers wanting to run the project's Hadoop jobs.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: What's the plan for Mahout?

Posted by Ted Dunning <te...@gmail.com>.
I would say that Mahout is about scalable machine learning solutions.  Those
may use Hadoop.  Or not.  The emphasis is on scaling.

On Sat, Sep 5, 2009 at 6:41 AM, Sean Owen <sr...@gmail.com> wrote:

> I hear some consensus that Mahout is about distributed, Hadoop-based
> solutions for developers. So let's make sure we present a clean,
> coherent API to developers wanting to run the project's Hadoop jobs.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: What's the plan for Mahout?

Posted by Sean Owen <sr...@gmail.com>.
To kind of wrap this up for now --

I hear some consensus that Mahout is about distributed, Hadoop-based
solutions for developers. So let's make sure we present a clean,
coherent API to developers wanting to run the project's Hadoop jobs.

I think we're a little bit stuck now as Hadoop 0.20.0 is a little bit
busted. But as it moves forward, perhaps I can volunteer to suggest
changes to unify the various jobs, mappers, reducers, etc. across the
project.

Sean

On Fri, Sep 4, 2009 at 11:21 PM, Grant Ingersoll<gs...@apache.org> wrote:
>
> On Sep 4, 2009, at 1:07 PM, Ted Dunning wrote:
>
>> These are good questions to ask.  I don't know that we are ready to answer
>> them, but I do think that we have pieces of the answers.
>>
>> So far, there are three or four general themes that seem to be of real
>> interest/value
>>
>> a) taste/collaborative filtering/cooccurrence analysis
>>
>> b) facilitation of conventional machine learning by large scale
>> aggregation
>> using hadoop (so far, this is largely cooccurrence counting)
>>
>> c) standard and basic machine learning tasks like clustering, simple
>> classifiers running on large scale data
>>
>> d) stuff
>
> I'd add a few non-technical things I find useful:
>
> e)  Non-viral License
>
> f) Community supporting it (i.e. not abandoned) and a place to get answers
> about practical problems.
>
> I've been frustrated more than once by the lack of (e) and (f) on some other
> projects.  Not that I'm saying we solve (f) yet completely (could use a bit
> more diversity in people answering, but that is starting to take hold, too),
> but I do firmly believe Apache is one of the best places to build a
> community.
>
> -Grant
>

Re: What's the plan for Mahout?

Posted by Bertie Shen <be...@gmail.com>.
Hi

  I just subscribed this maillist and plan to use mahout collaborative
filtering part. I feel that mahout may be better focused on a few algorithms
first and do it very well in a scalable way. Simple algorithms such as naive
bayes and {item|user}-based collaborative filtering may be the initial
focus. Complex algorithms such as LDA can be delayed. Applying a large data
set in simple algorithms, we can achieve a very good quality. This is also
where the scalability, one important characteristics of mahout, really
matters.

 Best,

Albert.

On Fri, Sep 4, 2009 at 3:21 PM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Sep 4, 2009, at 1:07 PM, Ted Dunning wrote:
>
>  These are good questions to ask.  I don't know that we are ready to answer
>> them, but I do think that we have pieces of the answers.
>>
>> So far, there are three or four general themes that seem to be of real
>> interest/value
>>
>> a) taste/collaborative filtering/cooccurrence analysis
>>
>> b) facilitation of conventional machine learning by large scale
>> aggregation
>> using hadoop (so far, this is largely cooccurrence counting)
>>
>> c) standard and basic machine learning tasks like clustering, simple
>> classifiers running on large scale data
>>
>> d) stuff
>>
>
> I'd add a few non-technical things I find useful:
>
> e)  Non-viral License
>
> f) Community supporting it (i.e. not abandoned) and a place to get answers
> about practical problems.
>
> I've been frustrated more than once by the lack of (e) and (f) on some
> other projects.  Not that I'm saying we solve (f) yet completely (could use
> a bit more diversity in people answering, but that is starting to take hold,
> too), but I do firmly believe Apache is one of the best places to build a
> community.
>
> -Grant
>

Re: What's the plan for Mahout?

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 4, 2009, at 1:07 PM, Ted Dunning wrote:

> These are good questions to ask.  I don't know that we are ready to  
> answer
> them, but I do think that we have pieces of the answers.
>
> So far, there are three or four general themes that seem to be of real
> interest/value
>
> a) taste/collaborative filtering/cooccurrence analysis
>
> b) facilitation of conventional machine learning by large scale  
> aggregation
> using hadoop (so far, this is largely cooccurrence counting)
>
> c) standard and basic machine learning tasks like clustering, simple
> classifiers running on large scale data
>
> d) stuff

I'd add a few non-technical things I find useful:

e)  Non-viral License

f) Community supporting it (i.e. not abandoned) and a place to get  
answers about practical problems.

I've been frustrated more than once by the lack of (e) and (f) on some  
other projects.  Not that I'm saying we solve (f) yet completely  
(could use a bit more diversity in people answering, but that is  
starting to take hold, too), but I do firmly believe Apache is one of  
the best places to build a community.

-Grant

Re: What's the plan for Mahout?

Posted by Ted Dunning <te...@gmail.com>.
These are good questions to ask.  I don't know that we are ready to answer
them, but I do think that we have pieces of the answers.

So far, there are three or four general themes that seem to be of real
interest/value

a) taste/collaborative filtering/cooccurrence analysis

b) facilitation of conventional machine learning by large scale aggregation
using hadoop (so far, this is largely cooccurrence counting)

c) standard and basic machine learning tasks like clustering, simple
classifiers running on large scale data

d) stuff

There is definitely pull for something like (a) both in the form of a CF
library roughly equivalent to lucene.  I know that I have a need for (b) and
occasionally (c).

It seems reasonable that we can provide a coherent story for (a), (b) and
(c).  If that is true, then (d) can go along for the ride.

The fact is, however, 99% of the machine learning that I do is quite doable
in a conventional system like R, although some of that 99% needs (b).  Very
occasionally I need algorithms to run at large scale, but those systems
always involve quite a bit of engineering to connect the data fire-hoses
into the right spigots.  I don't think that my experience all that unusual,
either.

Do other people share Sean's sense of urgency?

Is my break-down a reasonable one?

On Fri, Sep 4, 2009 at 9:13 AM, Sean Owen <sr...@gmail.com> wrote:

> It may be presumptuous but I volunteer to try to lead answers to these
> questions. It's going to lead to some tough answers and more work in
> some cases, no matter who drives it. Hoping to do it sooner than
> later.
>



-- 
Ted Dunning, CTO
DeepDyve