You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2015/03/06 21:30:26 UTC

Re: What is Mahout?

+1 for keeping the name

-1 for incubation




On Thu, Feb 26, 2015 at 5:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Along with workspaces, code completion, +1 for visualization and extended
> (bayesian, stats, etc) ops. Anything that is scalable and general seems
> fair game.
>
> Also -1 for incubation.  This is all an evolution of loosely collected
> algos into generalizations and extensions of legacy stuff on new ground.
>
> Also +1 for separating out packages more formally—like
> spark-itemsimilarity and other things that aren’t general. They may come
> with generalized bits (like similarity) but have package like delivery
> mechanisms. We should be able to have something better than contrib,
> especially since these may come with math and core extensions generally
> useful. No need to separate that until the core is done.
>
> However a new identity would be a big boost to being able to communicate
> the new mission—and is it is a new mission.  If the issue is about support
> for legacy that doesn’t seem to be a problem. If we stay a top level
> project we can support legacy, in fact we have to.
>
>
> On Feb 25, 2015, at 6:21 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> -1 on incubation as well. The website and docs and user lists and this
> champion and mentor stuff, and logos and promotions for committers
> absolutely do not make any sense at this point. From what i hear, people
> are pretty busy without having that as it is. It would probably make more
> sense to take both Andrews :) and committers who actively pursue the
> programming environment vision to PMC and for people who feel that they
> have no valuable input for new philosophy of the project just go emeritus
> and give up their voting rights. "Power of do", as they say.
>
> There's no major change in philosophy either -- mahout has been proclaiming
> "scalable machine learning", which is what we will continue doing. Only
> doing it (hopefully) a bit easier and with new set of backend tools.
>
> I want to emphasize that i'd seek math environment status in more general
> sense: not just algebraic, but also connect this to stats, samplers,
> optimizers, (including bayesian opts), feature extractors, i.e. all basic
> big ml tools. Adapt Spark's DataFrame to these tools where appropriate.
> Viewing it as solely distributed algebra is a bit skewed away from reality.
> On private branches, i have previously developed a lot of that
> functionality (except for the visual stuff) and it is in practice very
> useful; it creates a common umbrella for people with R background.
>
> I would very much want to integrate something for visualization, as it is
> important for environment. Unfortunately, I don't see any mature science
> plotting for jvm stuff around. Scatter plots at best. I want at least to be
> able to plot 2d maps and KDEs in with contours or density levels. There are
> ways to visualize massive datasets (and their parts). See no tools for this
> around at all. Maybe some clever way to integrate with ggplot2 or shiny
> server? even that would've been better, even if it required 3rd party
> software installation, than nothing at all.
>
> I don't expect methodologies go to contrib, actually. Slightly different
> modules, maybe, but not so extreme as contrib.
>
>
>
>
>
> On Wed, Feb 25, 2015 at 5:18 PM, Andrew Musselman <
> andrew.musselman@gmail.com> wrote:
>
> > How much would be involved in changing the name of a top-level project?
> >
> > I'd prefer to avoid the overhead of going back into incubation.
> >
> > I agree 0.10 makes more sense.
> >
> > On Wed, Feb 25, 2015 at 12:16 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> My $0.02:
> >>
> >> There is no shortage of algorithm libraries that are in some way
> >> runnable on Hadoop out there, and not as much easy-to-use distributed
> >> matrix operation libraries. I think it's more additive to the
> >> ecosystem to solve that narrow, and deep, linear algebra problem and
> >> really nail it. That's a pretty good 'identity' to claim. It seems
> >> like an appropriate scope.
> >>
> >> I do think the project has changed so much that it's more confusing to
> >> keep calling it Mahout than to change the name. I can't think of one
> >> person I've talked to about Mahout in the last 6 months that was not
> >> under the impression that what is in 0.9 has simply been ported to
> >> Spark. It's different enough that it could even be it's own incubator
> >> project (under a different name).
> >>
> >> The brand recognition is for the deprecated part so keeping that is
> >> almost the problem. It's not crazy to just change the name. Or even
> >> consider a re-incubation. It might give some latitude to more fully
> >> reboot.
> >>
> >> Releasing 1.0.0 on the other hand means committing to the APIs (and
> >> name) for some fairly new code and fairly soon. Given that this is
> >> sort of a 0.1 of a new project, going to 1.0 feels semantically wrong.
> >> But a release would be good. Personally I'd suggest 0.10.
> >>
> >> On Wed, Feb 25, 2015 at 5:50 PM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> >>> Looking back over the last year Mahout has gone through a lot of
> >> changes. Most users are still using the legacy mapreduce code and new
> > users
> >> have mostly looked elsewhere.
> >>>
> >>> The fact that people as knowledgable as former committers compare
> > Mahout
> >> to Oryx or MLlib seems odd to me because Mahout is neither a server nor
> a
> >> loose collection of algorithms. It was the later until all of mapreduce
> > was
> >> moved to legacy and “no new mapreduce” was the rule.
> >>>
> >>> But what is it now? What is unique and of value? Is it destined to be
> >> late to the party and chasing the algo checklists of things like MLlib?
> >>>
> >>> First a slight digression. I looked at moving itemsimilarity to raw
> >> Spark if only to remove mrlegacy from the dependencies. At about the
> same
> >> time another Mahouter asked the Spark list how to transpose a matrix. He
> >> got the answer “why would you want to do that?” The fairly high
> > performance
> >> algorithm behind spark-itemsimilarity was designed by Sebastian and
> >> requires an optimized A’A, A’B, A’C… and spark-rowsimilarity requires
> > AA’.
> >> None of these are provided by MLlib. No actual transpose is required so
> >> these two things should be seen as separate comments about MLlib. The
> >> moral: unless I want to write optimized matrix transpose-and-multiply
> >> solvers I will stick with Mahout.
> >>>
> >>> So back to Mahout’s unique value. Mahout today is a general linear
> >> algebra lib and environment that performs optimized calculations on
> > modern
> >> engines like Spark. It is something like a Scala-fied R on Spark (or
> > other
> >> engine).
> >>>
> >>> If this is true then spark-itemsimilarity can be seen as a
> >> package/add-on that requires Mahout’s core Linear Algebra.
> >>>
> >>> Why use Mahout? Use it if you need scalable general linear algebra.
> >> That’s not what MLlib does well.
> >>>
> >>> Should we be chasing MLlib’s algo list? Why would we? If we need some
> >> algo, why not consume it directly from MLlib or somewhere else? Why is a
> >> reimplementation important all else being equal?
> >>>
> >>> Is general scalable linear algebra sufficient for all important ML
> >> algos? Certainly not. For instance streaming ones and in particular
> > online
> >> updated streaming algos may have little to gain from Mahout as it is
> > today.
> >>>
> >>> If the above is true then Mahout is nothing like what it was in 0.9 and
> >> is being unfairly compared to 0.9 and other things like that. This
> >> misunderstanding of what Mahout _is_ leads to misapplied criticism and
> > lack
> >> of use for what it does well. At very least this all implies a very
> >> different description on the CMS at most maybe something as drastic as a
> >> name change.
> >>>
> >>>
> >>
> >
>
>