You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2015/02/25 18:50:52 UTC

What is Mahout?

Looking back over the last year Mahout has gone through a lot of changes. Most users are still using the legacy mapreduce code and new users have mostly looked elsewhere.

The fact that people as knowledgable as former committers compare Mahout to Oryx or MLlib seems odd to me because Mahout is neither a server nor a loose collection of algorithms. It was the later until all of mapreduce was moved to legacy and “no new mapreduce” was the rule.

But what is it now? What is unique and of value? Is it destined to be late to the party and chasing the algo checklists of things like MLlib?

First a slight digression. I looked at moving itemsimilarity to raw Spark if only to remove mrlegacy from the dependencies. At about the same time another Mahouter asked the Spark list how to transpose a matrix. He got the answer “why would you want to do that?” The fairly high performance algorithm behind spark-itemsimilarity was designed by Sebastian and requires an optimized A’A, A’B, A’C… and spark-rowsimilarity requires AA’. None of these are provided by MLlib. No actual transpose is required so these two things should be seen as separate comments about MLlib. The moral: unless I want to write optimized matrix transpose-and-multiply solvers I will stick with Mahout.

So back to Mahout’s unique value. Mahout today is a general linear algebra lib and environment that performs optimized calculations on modern engines like Spark. It is something like a Scala-fied R on Spark (or other engine).

If this is true then spark-itemsimilarity can be seen as a package/add-on that requires Mahout’s core Linear Algebra.

Why use Mahout? Use it if you need scalable general linear algebra. That’s not what MLlib does well.

Should we be chasing MLlib’s algo list? Why would we? If we need some algo, why not consume it directly from MLlib or somewhere else? Why is a reimplementation important all else being equal?

Is general scalable linear algebra sufficient for all important ML algos? Certainly not. For instance streaming ones and in particular online updated streaming algos may have little to gain from Mahout as it is today.

If the above is true then Mahout is nothing like what it was in 0.9 and is being unfairly compared to 0.9 and other things like that. This misunderstanding of what Mahout _is_ leads to misapplied criticism and lack of use for what it does well. At very least this all implies a very different description on the CMS at most maybe something as drastic as a name change.

Re: What is Mahout?

Posted by Andrew Musselman <an...@gmail.com>.

I'd like us to cut a 1.0.0 or 0.10.0 release with the spark work, then
commit to regular maintenance/point releases and a semi-yearly major
release cycle, and agree that publicizing it with talks and articles is
essential.

I don't think changing the name would do anything to reinvigorate or
clarify interest and perception.  Even though Mahout's "elephant driving"
legacy is deprecated, it has brand recognition behind it.

There are some good things in the code base, especially the linear algebra
work and the DSL, that like you guys mention is just not in other tools
right now.  I like the idea of clearly defining a contrib package like what
Pig has, to incorporate purpose-built jobs.

On Wed, Feb 25, 2015 at 10:42 AM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:

> I think a release with some value in it and a talk clarifying status will
> suffice for starters.
>
> Name change IMO is immaterial if there's the value and talks clarify
> general philosophy sufficiently. Nobody else can tell people better what it
> is all about, it is lack of the release and information that follows it
> turns people to speculations or legacy understanding of things.
>
> General philosophy -- yes, that's of R -base + R packages. Or, what i
> actually like more, is that of Julia ( + which can run on different
> distributed shared-nothing programming models). People use off-the-shelf
> stuff but people also do their own programming. I found that i have to
> customize methodologies in some way in at least 80% of cases, which is why
> value for me shifting towards 'r-base' rather than set of packages. As R
> demonstrates, do the former right-ish, and the latter will follow.
>
> I don't care for comparisons and don't spend time thinking on collating
> algorithm names. I'm strictly 100% pragmatically driven. If there's a black
> box thing and it fits, i just take it. if not, (and 80% of the time it is
> the "not") I'd have to do something of my own. Take SPCA, for example.
> There's no strict publication that describes its exact flow (knwon to me).
> It is just a 2-step derivation of Stochastic SVD (which is, in itself, a
> 2-step derivation/customization of random projection paper). These
> customizations and small derivations are actually incredibly numerous in
> practice.
>
> On mllib, here's probably little value in chasing mllib set of things -- at
> least not by "mahout-base" implementors, and not for spark backend. Since
> in Spark's case we act as an "add-on", all black box mllib things are
> already in our scope. They are, literally, available for programming
> environment of Mahout. But yes, probably some gentleman's survival kit
> should be eventually present even if it may repeat some of mllib methods as
> it is not automatically in the scope for Flink. (although, again, Flink has
> stuff like K-means too). Kinda hoped Flink guys could help with this one
> day.
>
>
> On Wed, Feb 25, 2015 at 9:50 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > Looking back over the last year Mahout has gone through a lot of changes.
> > Most users are still using the legacy mapreduce code and new users have
> > mostly looked elsewhere.
> >
> > The fact that people as knowledgable as former committers compare Mahout
> > to Oryx or MLlib seems odd to me because Mahout is neither a server nor a
> > loose collection of algorithms. It was the later until all of mapreduce
> was
> > moved to legacy and “no new mapreduce” was the rule.
> >
> > But what is it now? What is unique and of value? Is it destined to be
> late
> > to the party and chasing the algo checklists of things like MLlib?
> >
> > First a slight digression. I looked at moving itemsimilarity to raw Spark
> > if only to remove mrlegacy from the dependencies. At about the same time
> > another Mahouter asked the Spark list how to transpose a matrix. He got
> the
> > answer “why would you want to do that?” The fairly high performance
> > algorithm behind spark-itemsimilarity was designed by Sebastian and
> > requires an optimized A’A, A’B, A’C… and spark-rowsimilarity requires
> AA’.
> > None of these are provided by MLlib. No actual transpose is required so
> > these two things should be seen as separate comments about MLlib. The
> > moral: unless I want to write optimized matrix transpose-and-multiply
> > solvers I will stick with Mahout.
> >
> > So back to Mahout’s unique value. Mahout today is a general linear
> algebra
> > lib and environment that performs optimized calculations on modern
> engines
> > like Spark. It is something like a Scala-fied R on Spark (or other
> engine).
> >
> > If this is true then spark-itemsimilarity can be seen as a package/add-on
> > that requires Mahout’s core Linear Algebra.
> >
> > Why use Mahout? Use it if you need scalable general linear algebra.
> That’s
> > not what MLlib does well.
> >
> > Should we be chasing MLlib’s algo list? Why would we? If we need some
> > algo, why not consume it directly from MLlib or somewhere else? Why is a
> > reimplementation important all else being equal?
> >
> > Is general scalable linear algebra sufficient for all important ML algos?
> > Certainly not. For instance streaming ones and in particular online
> updated
> > streaming algos may have little to gain from Mahout as it is today.
> >
> > If the above is true then Mahout is nothing like what it was in 0.9 and
> is
> > being unfairly compared to 0.9 and other things like that. This
> > misunderstanding of what Mahout _is_ leads to misapplied criticism and
> lack
> > of use for what it does well. At very least this all implies a very
> > different description on the CMS at most maybe something as drastic as a
> > name change.
> >
> >
> >
>

Re: What is Mahout?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I’m not criticizing duplicate implementations at all. Just saying they shouldn’t be the group’s primary goal.

If we deliver an environment let’s act and talk like environment devs.

On Feb 25, 2015, at 10:42 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

I think a release with some value in it and a talk clarifying status will
suffice for starters.

Name change IMO is immaterial if there's the value and talks clarify
general philosophy sufficiently. Nobody else can tell people better what it
is all about, it is lack of the release and information that follows it
turns people to speculations or legacy understanding of things.

General philosophy -- yes, that's of R -base + R packages. Or, what i
actually like more, is that of Julia ( + which can run on different
distributed shared-nothing programming models). People use off-the-shelf
stuff but people also do their own programming. I found that i have to
customize methodologies in some way in at least 80% of cases, which is why
value for me shifting towards 'r-base' rather than set of packages. As R
demonstrates, do the former right-ish, and the latter will follow.

I don't care for comparisons and don't spend time thinking on collating
algorithm names. I'm strictly 100% pragmatically driven. If there's a black
box thing and it fits, i just take it. if not, (and 80% of the time it is
the "not") I'd have to do something of my own. Take SPCA, for example.
There's no strict publication that describes its exact flow (knwon to me).
It is just a 2-step derivation of Stochastic SVD (which is, in itself, a
2-step derivation/customization of random projection paper). These
customizations and small derivations are actually incredibly numerous in
practice.

On mllib, here's probably little value in chasing mllib set of things -- at
least not by "mahout-base" implementors, and not for spark backend. Since
in Spark's case we act as an "add-on", all black box mllib things are
already in our scope. They are, literally, available for programming
environment of Mahout. But yes, probably some gentleman's survival kit
should be eventually present even if it may repeat some of mllib methods as
it is not automatically in the scope for Flink. (although, again, Flink has
stuff like K-means too). Kinda hoped Flink guys could help with this one
day.

On Wed, Feb 25, 2015 at 9:50 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Looking back over the last year Mahout has gone through a lot of changes.
> Most users are still using the legacy mapreduce code and new users have
> mostly looked elsewhere.
> 
> The fact that people as knowledgable as former committers compare Mahout
> to Oryx or MLlib seems odd to me because Mahout is neither a server nor a
> loose collection of algorithms. It was the later until all of mapreduce was
> moved to legacy and “no new mapreduce” was the rule.
> 
> But what is it now? What is unique and of value? Is it destined to be late
> to the party and chasing the algo checklists of things like MLlib?
> 
> First a slight digression. I looked at moving itemsimilarity to raw Spark
> if only to remove mrlegacy from the dependencies. At about the same time
> another Mahouter asked the Spark list how to transpose a matrix. He got the
> answer “why would you want to do that?” The fairly high performance
> algorithm behind spark-itemsimilarity was designed by Sebastian and
> requires an optimized A’A, A’B, A’C… and spark-rowsimilarity requires AA’.
> None of these are provided by MLlib. No actual transpose is required so
> these two things should be seen as separate comments about MLlib. The
> moral: unless I want to write optimized matrix transpose-and-multiply
> solvers I will stick with Mahout.
> 
> So back to Mahout’s unique value. Mahout today is a general linear algebra
> lib and environment that performs optimized calculations on modern engines
> like Spark. It is something like a Scala-fied R on Spark (or other engine).
> 
> If this is true then spark-itemsimilarity can be seen as a package/add-on
> that requires Mahout’s core Linear Algebra.
> 
> Why use Mahout? Use it if you need scalable general linear algebra. That’s
> not what MLlib does well.
> 
> Should we be chasing MLlib’s algo list? Why would we? If we need some
> algo, why not consume it directly from MLlib or somewhere else? Why is a
> reimplementation important all else being equal?
> 
> Is general scalable linear algebra sufficient for all important ML algos?
> Certainly not. For instance streaming ones and in particular online updated
> streaming algos may have little to gain from Mahout as it is today.
> 
> If the above is true then Mahout is nothing like what it was in 0.9 and is
> being unfairly compared to 0.9 and other things like that. This
> misunderstanding of what Mahout _is_ leads to misapplied criticism and lack
> of use for what it does well. At very least this all implies a very
> different description on the CMS at most maybe something as drastic as a
> name change.
> 
> 
>

Re: What is Mahout?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I think a release with some value in it and a talk clarifying status will
suffice for starters.

Name change IMO is immaterial if there's the value and talks clarify
general philosophy sufficiently. Nobody else can tell people better what it
is all about, it is lack of the release and information that follows it
turns people to speculations or legacy understanding of things.

General philosophy -- yes, that's of R -base + R packages. Or, what i
actually like more, is that of Julia ( + which can run on different
distributed shared-nothing programming models). People use off-the-shelf
stuff but people also do their own programming. I found that i have to
customize methodologies in some way in at least 80% of cases, which is why
value for me shifting towards 'r-base' rather than set of packages. As R
demonstrates, do the former right-ish, and the latter will follow.

I don't care for comparisons and don't spend time thinking on collating
algorithm names. I'm strictly 100% pragmatically driven. If there's a black
box thing and it fits, i just take it. if not, (and 80% of the time it is
the "not") I'd have to do something of my own. Take SPCA, for example.
There's no strict publication that describes its exact flow (knwon to me).
It is just a 2-step derivation of Stochastic SVD (which is, in itself, a
2-step derivation/customization of random projection paper). These
customizations and small derivations are actually incredibly numerous in
practice.

On mllib, here's probably little value in chasing mllib set of things -- at
least not by "mahout-base" implementors, and not for spark backend. Since
in Spark's case we act as an "add-on", all black box mllib things are
already in our scope. They are, literally, available for programming
environment of Mahout. But yes, probably some gentleman's survival kit
should be eventually present even if it may repeat some of mllib methods as
it is not automatically in the scope for Flink. (although, again, Flink has
stuff like K-means too). Kinda hoped Flink guys could help with this one
day.

On Wed, Feb 25, 2015 at 9:50 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Looking back over the last year Mahout has gone through a lot of changes.
> Most users are still using the legacy mapreduce code and new users have
> mostly looked elsewhere.
>
> The fact that people as knowledgable as former committers compare Mahout
> to Oryx or MLlib seems odd to me because Mahout is neither a server nor a
> loose collection of algorithms. It was the later until all of mapreduce was
> moved to legacy and “no new mapreduce” was the rule.
>
> But what is it now? What is unique and of value? Is it destined to be late
> to the party and chasing the algo checklists of things like MLlib?
>
> First a slight digression. I looked at moving itemsimilarity to raw Spark
> if only to remove mrlegacy from the dependencies. At about the same time
> another Mahouter asked the Spark list how to transpose a matrix. He got the
> answer “why would you want to do that?” The fairly high performance
> algorithm behind spark-itemsimilarity was designed by Sebastian and
> requires an optimized A’A, A’B, A’C… and spark-rowsimilarity requires AA’.
> None of these are provided by MLlib. No actual transpose is required so
> these two things should be seen as separate comments about MLlib. The
> moral: unless I want to write optimized matrix transpose-and-multiply
> solvers I will stick with Mahout.
>
> So back to Mahout’s unique value. Mahout today is a general linear algebra
> lib and environment that performs optimized calculations on modern engines
> like Spark. It is something like a Scala-fied R on Spark (or other engine).
>
> If this is true then spark-itemsimilarity can be seen as a package/add-on
> that requires Mahout’s core Linear Algebra.
>
> Why use Mahout? Use it if you need scalable general linear algebra. That’s
> not what MLlib does well.
>
> Should we be chasing MLlib’s algo list? Why would we? If we need some
> algo, why not consume it directly from MLlib or somewhere else? Why is a
> reimplementation important all else being equal?
>
> Is general scalable linear algebra sufficient for all important ML algos?
> Certainly not. For instance streaming ones and in particular online updated
> streaming algos may have little to gain from Mahout as it is today.
>
> If the above is true then Mahout is nothing like what it was in 0.9 and is
> being unfairly compared to 0.9 and other things like that. This
> misunderstanding of what Mahout _is_ leads to misapplied criticism and lack
> of use for what it does well. At very least this all implies a very
> different description on the CMS at most maybe something as drastic as a
> name change.
>
>
>

Re: What is Mahout?

Posted by Ted Dunning <te...@gmail.com>.

+1 for keeping the name

-1 for incubation




On Thu, Feb 26, 2015 at 5:24 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Along with workspaces, code completion, +1 for visualization and extended
> (bayesian, stats, etc) ops. Anything that is scalable and general seems
> fair game.
>
> Also -1 for incubation.  This is all an evolution of loosely collected
> algos into generalizations and extensions of legacy stuff on new ground.
>
> Also +1 for separating out packages more formally—like
> spark-itemsimilarity and other things that aren’t general. They may come
> with generalized bits (like similarity) but have package like delivery
> mechanisms. We should be able to have something better than contrib,
> especially since these may come with math and core extensions generally
> useful. No need to separate that until the core is done.
>
> However a new identity would be a big boost to being able to communicate
> the new mission—and is it is a new mission.  If the issue is about support
> for legacy that doesn’t seem to be a problem. If we stay a top level
> project we can support legacy, in fact we have to.
>
>
> On Feb 25, 2015, at 6:21 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> -1 on incubation as well. The website and docs and user lists and this
> champion and mentor stuff, and logos and promotions for committers
> absolutely do not make any sense at this point. From what i hear, people
> are pretty busy without having that as it is. It would probably make more
> sense to take both Andrews :) and committers who actively pursue the
> programming environment vision to PMC and for people who feel that they
> have no valuable input for new philosophy of the project just go emeritus
> and give up their voting rights. "Power of do", as they say.
>
> There's no major change in philosophy either -- mahout has been proclaiming
> "scalable machine learning", which is what we will continue doing. Only
> doing it (hopefully) a bit easier and with new set of backend tools.
>
> I want to emphasize that i'd seek math environment status in more general
> sense: not just algebraic, but also connect this to stats, samplers,
> optimizers, (including bayesian opts), feature extractors, i.e. all basic
> big ml tools. Adapt Spark's DataFrame to these tools where appropriate.
> Viewing it as solely distributed algebra is a bit skewed away from reality.
> On private branches, i have previously developed a lot of that
> functionality (except for the visual stuff) and it is in practice very
> useful; it creates a common umbrella for people with R background.
>
> I would very much want to integrate something for visualization, as it is
> important for environment. Unfortunately, I don't see any mature science
> plotting for jvm stuff around. Scatter plots at best. I want at least to be
> able to plot 2d maps and KDEs in with contours or density levels. There are
> ways to visualize massive datasets (and their parts). See no tools for this
> around at all. Maybe some clever way to integrate with ggplot2 or shiny
> server? even that would've been better, even if it required 3rd party
> software installation, than nothing at all.
>
> I don't expect methodologies go to contrib, actually. Slightly different
> modules, maybe, but not so extreme as contrib.
>
>
>
>
>
> On Wed, Feb 25, 2015 at 5:18 PM, Andrew Musselman <
> andrew.musselman@gmail.com> wrote:
>
> > How much would be involved in changing the name of a top-level project?
> >
> > I'd prefer to avoid the overhead of going back into incubation.
> >
> > I agree 0.10 makes more sense.
> >
> > On Wed, Feb 25, 2015 at 12:16 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> My $0.02:
> >>
> >> There is no shortage of algorithm libraries that are in some way
> >> runnable on Hadoop out there, and not as much easy-to-use distributed
> >> matrix operation libraries. I think it's more additive to the
> >> ecosystem to solve that narrow, and deep, linear algebra problem and
> >> really nail it. That's a pretty good 'identity' to claim. It seems
> >> like an appropriate scope.
> >>
> >> I do think the project has changed so much that it's more confusing to
> >> keep calling it Mahout than to change the name. I can't think of one
> >> person I've talked to about Mahout in the last 6 months that was not
> >> under the impression that what is in 0.9 has simply been ported to
> >> Spark. It's different enough that it could even be it's own incubator
> >> project (under a different name).
> >>
> >> The brand recognition is for the deprecated part so keeping that is
> >> almost the problem. It's not crazy to just change the name. Or even
> >> consider a re-incubation. It might give some latitude to more fully
> >> reboot.
> >>
> >> Releasing 1.0.0 on the other hand means committing to the APIs (and
> >> name) for some fairly new code and fairly soon. Given that this is
> >> sort of a 0.1 of a new project, going to 1.0 feels semantically wrong.
> >> But a release would be good. Personally I'd suggest 0.10.
> >>
> >> On Wed, Feb 25, 2015 at 5:50 PM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> >>> Looking back over the last year Mahout has gone through a lot of
> >> changes. Most users are still using the legacy mapreduce code and new
> > users
> >> have mostly looked elsewhere.
> >>>
> >>> The fact that people as knowledgable as former committers compare
> > Mahout
> >> to Oryx or MLlib seems odd to me because Mahout is neither a server nor
> a
> >> loose collection of algorithms. It was the later until all of mapreduce
> > was
> >> moved to legacy and “no new mapreduce” was the rule.
> >>>
> >>> But what is it now? What is unique and of value? Is it destined to be
> >> late to the party and chasing the algo checklists of things like MLlib?
> >>>
> >>> First a slight digression. I looked at moving itemsimilarity to raw
> >> Spark if only to remove mrlegacy from the dependencies. At about the
> same
> >> time another Mahouter asked the Spark list how to transpose a matrix. He
> >> got the answer “why would you want to do that?” The fairly high
> > performance
> >> algorithm behind spark-itemsimilarity was designed by Sebastian and
> >> requires an optimized A’A, A’B, A’C… and spark-rowsimilarity requires
> > AA’.
> >> None of these are provided by MLlib. No actual transpose is required so
> >> these two things should be seen as separate comments about MLlib. The
> >> moral: unless I want to write optimized matrix transpose-and-multiply
> >> solvers I will stick with Mahout.
> >>>
> >>> So back to Mahout’s unique value. Mahout today is a general linear
> >> algebra lib and environment that performs optimized calculations on
> > modern
> >> engines like Spark. It is something like a Scala-fied R on Spark (or
> > other
> >> engine).
> >>>
> >>> If this is true then spark-itemsimilarity can be seen as a
> >> package/add-on that requires Mahout’s core Linear Algebra.
> >>>
> >>> Why use Mahout? Use it if you need scalable general linear algebra.
> >> That’s not what MLlib does well.
> >>>
> >>> Should we be chasing MLlib’s algo list? Why would we? If we need some
> >> algo, why not consume it directly from MLlib or somewhere else? Why is a
> >> reimplementation important all else being equal?
> >>>
> >>> Is general scalable linear algebra sufficient for all important ML
> >> algos? Certainly not. For instance streaming ones and in particular
> > online
> >> updated streaming algos may have little to gain from Mahout as it is
> > today.
> >>>
> >>> If the above is true then Mahout is nothing like what it was in 0.9 and
> >> is being unfairly compared to 0.9 and other things like that. This
> >> misunderstanding of what Mahout _is_ leads to misapplied criticism and
> > lack
> >> of use for what it does well. At very least this all implies a very
> >> different description on the CMS at most maybe something as drastic as a
> >> name change.
> >>>
> >>>
> >>
> >
>
>

Re: What is Mahout?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Along with workspaces, code completion, +1 for visualization and extended (bayesian, stats, etc) ops. Anything that is scalable and general seems fair game.

Also -1 for incubation.  This is all an evolution of loosely collected algos into generalizations and extensions of legacy stuff on new ground.

Also +1 for separating out packages more formally—like spark-itemsimilarity and other things that aren’t general. They may come with generalized bits (like similarity) but have package like delivery mechanisms. We should be able to have something better than contrib, especially since these may come with math and core extensions generally useful. No need to separate that until the core is done.

However a new identity would be a big boost to being able to communicate the new mission—and is it is a new mission.  If the issue is about support for legacy that doesn’t seem to be a problem. If we stay a top level project we can support legacy, in fact we have to.

On Feb 25, 2015, at 6:21 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

-1 on incubation as well. The website and docs and user lists and this
champion and mentor stuff, and logos and promotions for committers
absolutely do not make any sense at this point. From what i hear, people
are pretty busy without having that as it is. It would probably make more
sense to take both Andrews :) and committers who actively pursue the
programming environment vision to PMC and for people who feel that they
have no valuable input for new philosophy of the project just go emeritus
and give up their voting rights. "Power of do", as they say.

There's no major change in philosophy either -- mahout has been proclaiming
"scalable machine learning", which is what we will continue doing. Only
doing it (hopefully) a bit easier and with new set of backend tools.

I want to emphasize that i'd seek math environment status in more general
sense: not just algebraic, but also connect this to stats, samplers,
optimizers, (including bayesian opts), feature extractors, i.e. all basic
big ml tools. Adapt Spark's DataFrame to these tools where appropriate.
Viewing it as solely distributed algebra is a bit skewed away from reality.
On private branches, i have previously developed a lot of that
functionality (except for the visual stuff) and it is in practice very
useful; it creates a common umbrella for people with R background.

I would very much want to integrate something for visualization, as it is
important for environment. Unfortunately, I don't see any mature science
plotting for jvm stuff around. Scatter plots at best. I want at least to be
able to plot 2d maps and KDEs in with contours or density levels. There are
ways to visualize massive datasets (and their parts). See no tools for this
around at all. Maybe some clever way to integrate with ggplot2 or shiny
server? even that would've been better, even if it required 3rd party
software installation, than nothing at all.

I don't expect methodologies go to contrib, actually. Slightly different
modules, maybe, but not so extreme as contrib.

On Wed, Feb 25, 2015 at 5:18 PM, Andrew Musselman <
andrew.musselman@gmail.com> wrote:

> How much would be involved in changing the name of a top-level project?
> 
> I'd prefer to avoid the overhead of going back into incubation.
> 
> I agree 0.10 makes more sense.
> 
> On Wed, Feb 25, 2015 at 12:16 PM, Sean Owen <sr...@gmail.com> wrote:
> 
>> My $0.02:
>> 
>> There is no shortage of algorithm libraries that are in some way
>> runnable on Hadoop out there, and not as much easy-to-use distributed
>> matrix operation libraries. I think it's more additive to the
>> ecosystem to solve that narrow, and deep, linear algebra problem and
>> really nail it. That's a pretty good 'identity' to claim. It seems
>> like an appropriate scope.
>> 
>> I do think the project has changed so much that it's more confusing to
>> keep calling it Mahout than to change the name. I can't think of one
>> person I've talked to about Mahout in the last 6 months that was not
>> under the impression that what is in 0.9 has simply been ported to
>> Spark. It's different enough that it could even be it's own incubator
>> project (under a different name).
>> 
>> The brand recognition is for the deprecated part so keeping that is
>> almost the problem. It's not crazy to just change the name. Or even
>> consider a re-incubation. It might give some latitude to more fully
>> reboot.
>> 
>> Releasing 1.0.0 on the other hand means committing to the APIs (and
>> name) for some fairly new code and fairly soon. Given that this is
>> sort of a 0.1 of a new project, going to 1.0 feels semantically wrong.
>> But a release would be good. Personally I'd suggest 0.10.
>> 
>> On Wed, Feb 25, 2015 at 5:50 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>>> Looking back over the last year Mahout has gone through a lot of
>> changes. Most users are still using the legacy mapreduce code and new
> users
>> have mostly looked elsewhere.
>>> 
>>> The fact that people as knowledgable as former committers compare
> Mahout
>> to Oryx or MLlib seems odd to me because Mahout is neither a server nor a
>> loose collection of algorithms. It was the later until all of mapreduce
> was
>> moved to legacy and “no new mapreduce” was the rule.
>>> 
>>> But what is it now? What is unique and of value? Is it destined to be
>> late to the party and chasing the algo checklists of things like MLlib?
>>> 
>>> First a slight digression. I looked at moving itemsimilarity to raw
>> Spark if only to remove mrlegacy from the dependencies. At about the same
>> time another Mahouter asked the Spark list how to transpose a matrix. He
>> got the answer “why would you want to do that?” The fairly high
> performance
>> algorithm behind spark-itemsimilarity was designed by Sebastian and
>> requires an optimized A’A, A’B, A’C… and spark-rowsimilarity requires
> AA’.
>> None of these are provided by MLlib. No actual transpose is required so
>> these two things should be seen as separate comments about MLlib. The
>> moral: unless I want to write optimized matrix transpose-and-multiply
>> solvers I will stick with Mahout.
>>> 
>>> So back to Mahout’s unique value. Mahout today is a general linear
>> algebra lib and environment that performs optimized calculations on
> modern
>> engines like Spark. It is something like a Scala-fied R on Spark (or
> other
>> engine).
>>> 
>>> If this is true then spark-itemsimilarity can be seen as a
>> package/add-on that requires Mahout’s core Linear Algebra.
>>> 
>>> Why use Mahout? Use it if you need scalable general linear algebra.
>> That’s not what MLlib does well.
>>> 
>>> Should we be chasing MLlib’s algo list? Why would we? If we need some
>> algo, why not consume it directly from MLlib or somewhere else? Why is a
>> reimplementation important all else being equal?
>>> 
>>> Is general scalable linear algebra sufficient for all important ML
>> algos? Certainly not. For instance streaming ones and in particular
> online
>> updated streaming algos may have little to gain from Mahout as it is
> today.
>>> 
>>> If the above is true then Mahout is nothing like what it was in 0.9 and
>> is being unfairly compared to 0.9 and other things like that. This
>> misunderstanding of what Mahout _is_ leads to misapplied criticism and
> lack
>> of use for what it does well. At very least this all implies a very
>> different description on the CMS at most maybe something as drastic as a
>> name change.
>>> 
>>> 
>> 
>

Re: What is Mahout?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

-1 on incubation as well. The website and docs and user lists and this
champion and mentor stuff, and logos and promotions for committers
 absolutely do not make any sense at this point. From what i hear, people
are pretty busy without having that as it is. It would probably make more
sense to take both Andrews :) and committers who actively pursue the
programming environment vision to PMC and for people who feel that they
have no valuable input for new philosophy of the project just go emeritus
and give up their voting rights. "Power of do", as they say.

There's no major change in philosophy either -- mahout has been proclaiming
"scalable machine learning", which is what we will continue doing. Only
doing it (hopefully) a bit easier and with new set of backend tools.

I want to emphasize that i'd seek math environment status in more general
sense: not just algebraic, but also connect this to stats, samplers,
optimizers, (including bayesian opts), feature extractors, i.e. all basic
big ml tools. Adapt Spark's DataFrame to these tools where appropriate.
Viewing it as solely distributed algebra is a bit skewed away from reality.
On private branches, i have previously developed a lot of that
functionality (except for the visual stuff) and it is in practice very
useful; it creates a common umbrella for people with R background.

I would very much want to integrate something for visualization, as it is
important for environment. Unfortunately, I don't see any mature science
plotting for jvm stuff around. Scatter plots at best. I want at least to be
able to plot 2d maps and KDEs in with contours or density levels. There are
ways to visualize massive datasets (and their parts). See no tools for this
around at all. Maybe some clever way to integrate with ggplot2 or shiny
server? even that would've been better, even if it required 3rd party
software installation, than nothing at all.

I don't expect methodologies go to contrib, actually. Slightly different
modules, maybe, but not so extreme as contrib.





On Wed, Feb 25, 2015 at 5:18 PM, Andrew Musselman <
andrew.musselman@gmail.com> wrote:

> How much would be involved in changing the name of a top-level project?
>
> I'd prefer to avoid the overhead of going back into incubation.
>
> I agree 0.10 makes more sense.
>
> On Wed, Feb 25, 2015 at 12:16 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > My $0.02:
> >
> > There is no shortage of algorithm libraries that are in some way
> > runnable on Hadoop out there, and not as much easy-to-use distributed
> > matrix operation libraries. I think it's more additive to the
> > ecosystem to solve that narrow, and deep, linear algebra problem and
> > really nail it. That's a pretty good 'identity' to claim. It seems
> > like an appropriate scope.
> >
> > I do think the project has changed so much that it's more confusing to
> > keep calling it Mahout than to change the name. I can't think of one
> > person I've talked to about Mahout in the last 6 months that was not
> > under the impression that what is in 0.9 has simply been ported to
> > Spark. It's different enough that it could even be it's own incubator
> > project (under a different name).
> >
> > The brand recognition is for the deprecated part so keeping that is
> > almost the problem. It's not crazy to just change the name. Or even
> > consider a re-incubation. It might give some latitude to more fully
> > reboot.
> >
> > Releasing 1.0.0 on the other hand means committing to the APIs (and
> > name) for some fairly new code and fairly soon. Given that this is
> > sort of a 0.1 of a new project, going to 1.0 feels semantically wrong.
> > But a release would be good. Personally I'd suggest 0.10.
> >
> > On Wed, Feb 25, 2015 at 5:50 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> > > Looking back over the last year Mahout has gone through a lot of
> > changes. Most users are still using the legacy mapreduce code and new
> users
> > have mostly looked elsewhere.
> > >
> > > The fact that people as knowledgable as former committers compare
> Mahout
> > to Oryx or MLlib seems odd to me because Mahout is neither a server nor a
> > loose collection of algorithms. It was the later until all of mapreduce
> was
> > moved to legacy and “no new mapreduce” was the rule.
> > >
> > > But what is it now? What is unique and of value? Is it destined to be
> > late to the party and chasing the algo checklists of things like MLlib?
> > >
> > > First a slight digression. I looked at moving itemsimilarity to raw
> > Spark if only to remove mrlegacy from the dependencies. At about the same
> > time another Mahouter asked the Spark list how to transpose a matrix. He
> > got the answer “why would you want to do that?” The fairly high
> performance
> > algorithm behind spark-itemsimilarity was designed by Sebastian and
> > requires an optimized A’A, A’B, A’C… and spark-rowsimilarity requires
> AA’.
> > None of these are provided by MLlib. No actual transpose is required so
> > these two things should be seen as separate comments about MLlib. The
> > moral: unless I want to write optimized matrix transpose-and-multiply
> > solvers I will stick with Mahout.
> > >
> > > So back to Mahout’s unique value. Mahout today is a general linear
> > algebra lib and environment that performs optimized calculations on
> modern
> > engines like Spark. It is something like a Scala-fied R on Spark (or
> other
> > engine).
> > >
> > > If this is true then spark-itemsimilarity can be seen as a
> > package/add-on that requires Mahout’s core Linear Algebra.
> > >
> > > Why use Mahout? Use it if you need scalable general linear algebra.
> > That’s not what MLlib does well.
> > >
> > > Should we be chasing MLlib’s algo list? Why would we? If we need some
> > algo, why not consume it directly from MLlib or somewhere else? Why is a
> > reimplementation important all else being equal?
> > >
> > > Is general scalable linear algebra sufficient for all important ML
> > algos? Certainly not. For instance streaming ones and in particular
> online
> > updated streaming algos may have little to gain from Mahout as it is
> today.
> > >
> > > If the above is true then Mahout is nothing like what it was in 0.9 and
> > is being unfairly compared to 0.9 and other things like that. This
> > misunderstanding of what Mahout _is_ leads to misapplied criticism and
> lack
> > of use for what it does well. At very least this all implies a very
> > different description on the CMS at most maybe something as drastic as a
> > name change.
> > >
> > >
> >
>

Re: What is Mahout?

Posted by Andrew Musselman <an...@gmail.com>.

How much would be involved in changing the name of a top-level project?

I'd prefer to avoid the overhead of going back into incubation.

I agree 0.10 makes more sense.

On Wed, Feb 25, 2015 at 12:16 PM, Sean Owen <sr...@gmail.com> wrote:

> My $0.02:
>
> There is no shortage of algorithm libraries that are in some way
> runnable on Hadoop out there, and not as much easy-to-use distributed
> matrix operation libraries. I think it's more additive to the
> ecosystem to solve that narrow, and deep, linear algebra problem and
> really nail it. That's a pretty good 'identity' to claim. It seems
> like an appropriate scope.
>
> I do think the project has changed so much that it's more confusing to
> keep calling it Mahout than to change the name. I can't think of one
> person I've talked to about Mahout in the last 6 months that was not
> under the impression that what is in 0.9 has simply been ported to
> Spark. It's different enough that it could even be it's own incubator
> project (under a different name).
>
> The brand recognition is for the deprecated part so keeping that is
> almost the problem. It's not crazy to just change the name. Or even
> consider a re-incubation. It might give some latitude to more fully
> reboot.
>
> Releasing 1.0.0 on the other hand means committing to the APIs (and
> name) for some fairly new code and fairly soon. Given that this is
> sort of a 0.1 of a new project, going to 1.0 feels semantically wrong.
> But a release would be good. Personally I'd suggest 0.10.
>
> On Wed, Feb 25, 2015 at 5:50 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> > Looking back over the last year Mahout has gone through a lot of
> changes. Most users are still using the legacy mapreduce code and new users
> have mostly looked elsewhere.
> >
> > The fact that people as knowledgable as former committers compare Mahout
> to Oryx or MLlib seems odd to me because Mahout is neither a server nor a
> loose collection of algorithms. It was the later until all of mapreduce was
> moved to legacy and “no new mapreduce” was the rule.
> >
> > But what is it now? What is unique and of value? Is it destined to be
> late to the party and chasing the algo checklists of things like MLlib?
> >
> > First a slight digression. I looked at moving itemsimilarity to raw
> Spark if only to remove mrlegacy from the dependencies. At about the same
> time another Mahouter asked the Spark list how to transpose a matrix. He
> got the answer “why would you want to do that?” The fairly high performance
> algorithm behind spark-itemsimilarity was designed by Sebastian and
> requires an optimized A’A, A’B, A’C… and spark-rowsimilarity requires AA’.
> None of these are provided by MLlib. No actual transpose is required so
> these two things should be seen as separate comments about MLlib. The
> moral: unless I want to write optimized matrix transpose-and-multiply
> solvers I will stick with Mahout.
> >
> > So back to Mahout’s unique value. Mahout today is a general linear
> algebra lib and environment that performs optimized calculations on modern
> engines like Spark. It is something like a Scala-fied R on Spark (or other
> engine).
> >
> > If this is true then spark-itemsimilarity can be seen as a
> package/add-on that requires Mahout’s core Linear Algebra.
> >
> > Why use Mahout? Use it if you need scalable general linear algebra.
> That’s not what MLlib does well.
> >
> > Should we be chasing MLlib’s algo list? Why would we? If we need some
> algo, why not consume it directly from MLlib or somewhere else? Why is a
> reimplementation important all else being equal?
> >
> > Is general scalable linear algebra sufficient for all important ML
> algos? Certainly not. For instance streaming ones and in particular online
> updated streaming algos may have little to gain from Mahout as it is today.
> >
> > If the above is true then Mahout is nothing like what it was in 0.9 and
> is being unfairly compared to 0.9 and other things like that. This
> misunderstanding of what Mahout _is_ leads to misapplied criticism and lack
> of use for what it does well. At very least this all implies a very
> different description on the CMS at most maybe something as drastic as a
> name change.
> >
> >
>

Re: What is Mahout?

Posted by Sean Owen <sr...@gmail.com>.

My $0.02:

There is no shortage of algorithm libraries that are in some way
runnable on Hadoop out there, and not as much easy-to-use distributed
matrix operation libraries. I think it's more additive to the
ecosystem to solve that narrow, and deep, linear algebra problem and
really nail it. That's a pretty good 'identity' to claim. It seems
like an appropriate scope.

I do think the project has changed so much that it's more confusing to
keep calling it Mahout than to change the name. I can't think of one
person I've talked to about Mahout in the last 6 months that was not
under the impression that what is in 0.9 has simply been ported to
Spark. It's different enough that it could even be it's own incubator
project (under a different name).

The brand recognition is for the deprecated part so keeping that is
almost the problem. It's not crazy to just change the name. Or even
consider a re-incubation. It might give some latitude to more fully
reboot.

Releasing 1.0.0 on the other hand means committing to the APIs (and
name) for some fairly new code and fairly soon. Given that this is
sort of a 0.1 of a new project, going to 1.0 feels semantically wrong.
But a release would be good. Personally I'd suggest 0.10.

On Wed, Feb 25, 2015 at 5:50 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> Looking back over the last year Mahout has gone through a lot of changes. Most users are still using the legacy mapreduce code and new users have mostly looked elsewhere.
>
> The fact that people as knowledgable as former committers compare Mahout to Oryx or MLlib seems odd to me because Mahout is neither a server nor a loose collection of algorithms. It was the later until all of mapreduce was moved to legacy and “no new mapreduce” was the rule.
>
> But what is it now? What is unique and of value? Is it destined to be late to the party and chasing the algo checklists of things like MLlib?
>
> First a slight digression. I looked at moving itemsimilarity to raw Spark if only to remove mrlegacy from the dependencies. At about the same time another Mahouter asked the Spark list how to transpose a matrix. He got the answer “why would you want to do that?” The fairly high performance algorithm behind spark-itemsimilarity was designed by Sebastian and requires an optimized A’A, A’B, A’C… and spark-rowsimilarity requires AA’. None of these are provided by MLlib. No actual transpose is required so these two things should be seen as separate comments about MLlib. The moral: unless I want to write optimized matrix transpose-and-multiply solvers I will stick with Mahout.
>
> So back to Mahout’s unique value. Mahout today is a general linear algebra lib and environment that performs optimized calculations on modern engines like Spark. It is something like a Scala-fied R on Spark (or other engine).
>
> If this is true then spark-itemsimilarity can be seen as a package/add-on that requires Mahout’s core Linear Algebra.
>
> Why use Mahout? Use it if you need scalable general linear algebra. That’s not what MLlib does well.
>
> Should we be chasing MLlib’s algo list? Why would we? If we need some algo, why not consume it directly from MLlib or somewhere else? Why is a reimplementation important all else being equal?
>
> Is general scalable linear algebra sufficient for all important ML algos? Certainly not. For instance streaming ones and in particular online updated streaming algos may have little to gain from Mahout as it is today.
>
> If the above is true then Mahout is nothing like what it was in 0.9 and is being unfairly compared to 0.9 and other things like that. This misunderstanding of what Mahout _is_ leads to misapplied criticism and lack of use for what it does well. At very least this all implies a very different description on the CMS at most maybe something as drastic as a name change.
>
>