You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Mahesh Balija <ba...@gmail.com> on 2014/10/22 00:04:16 UTC

Mahout Vs Spark

Hi Team,

As Spark framework is gaining more attention in the Big Data - Open source
frameworks, with the support for variety of applications like,

1) Shark
2) GraphX
3) MLLib
4) Streaming

With the rapid development algorithms supporting Clustering,
Classification, Regression etc in the MLlib package and inbuilt support for
Scala.
I am trying to differentiate between Mahout and Spark, here is the small
list,

  Features Mahout Spark  Clustering Y Y  Classification Y Y
Regression Y Y  Dimensionality
Reduction Y Y  Java Y Y  Scala N Y  Python N Y  Numpy N Y  Hadoop Y Y  Text
Mining Y N  Scala/Spark Bindings Y N/A  scalability Y  Y
Apart from above, Mahout has vast coverage of Machine Learning algorithms
with many utilities and API's as opposed to Spark.
And Mahout 1.0 providing support for Scala, Spark bindings.

I was trying to demarcate between Mahout and Spark?
Can you throw some light on key differences and uniqueness of Mahout
framework. Am I missing any important distinction which makes Mahout the
only choice for Scalable machine learning.

Best,
Mahesh.B

Re: Mahout Vs Spark

Posted by Ted Dunning <te...@gmail.com>.

What you say does not imply that numpy can inter-operate with existing
Spark machine learning code.  It is also certainly the case that no numpy
currently uses Spark.

It may well be that users could use numpy in closures being sent to Spark,
but that is a far walk from useful parallel numerical code.



On Thu, Oct 23, 2014 at 4:48 PM, thejas prasad <th...@gmail.com> wrote:

>  Ted I am not too sure but this https://spark.apache.org/faq.html,
> suggests
> otherwise I think. Does Spark require modified versions of Scala or Python?
>
> No. Spark requires no changes to Scala or compiler plugins. The Python API
> uses the standard CPython implementation, and can call into existing C
> libraries for Python such as NumPy.
>
>
>
> On Thu, Oct 23, 2014 at 1:11 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Hmmm....
> >
> > I don't think that the array formats used by Spark are compatible with
> the
> > formats used by numpy.
> >
> > I could be wrong, but even if there isn't outright incompatibility, there
> > is likely to be some significant overhead in format conversion.
> >
> >
> > On Tue, Oct 21, 2014 at 6:12 PM, Vibhanshu Prasad <
> > vibhanshugsoc2@gmail.com>
> > wrote:
> >
> > > actually spark is available in python also, so users of spark are
> having
> > an
> > > upper hand over users of traditional users of mahout. This is
> applicable
> > to
> > > all the libraries of python (including numpy).
> > >
> > > On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
> > > balijamahesh.mca@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > I am trying to differentiate between Mahout and Spark, here is the
> > > small
> > > > > list,
> > > > >
> > > > >   Features Mahout Spark  Clustering Y Y  Classification Y Y
> > > Regression Y
> > > > > Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y
> > > Numpy N
> > > > > Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A
> > > scalability Y
> > > > > Y
> > > > >
> > > >
> > > > Mahout doesn't actually have strong features for clustering,
> > > classification
> > > > and regression. Mahout is very strong in recommendations (which you
> > don't
> > > > mention) and dimensionality reduction.
> > > >
> > > > Mahout does support scala in the development version.
> > > >
> > > > What do you mean by support for Numpy?
> > > >
> > >
> >
>

Re: Mahout Vs Spark

Posted by thejas prasad <th...@gmail.com>.

 Ted I am not too sure but this https://spark.apache.org/faq.html, suggests
otherwise I think. Does Spark require modified versions of Scala or Python?

No. Spark requires no changes to Scala or compiler plugins. The Python API
uses the standard CPython implementation, and can call into existing C
libraries for Python such as NumPy.



On Thu, Oct 23, 2014 at 1:11 PM, Ted Dunning <te...@gmail.com> wrote:

> Hmmm....
>
> I don't think that the array formats used by Spark are compatible with the
> formats used by numpy.
>
> I could be wrong, but even if there isn't outright incompatibility, there
> is likely to be some significant overhead in format conversion.
>
>
> On Tue, Oct 21, 2014 at 6:12 PM, Vibhanshu Prasad <
> vibhanshugsoc2@gmail.com>
> wrote:
>
> > actually spark is available in python also, so users of spark are having
> an
> > upper hand over users of traditional users of mahout. This is applicable
> to
> > all the libraries of python (including numpy).
> >
> > On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
> > balijamahesh.mca@gmail.com
> > > >
> > > wrote:
> > >
> > > > I am trying to differentiate between Mahout and Spark, here is the
> > small
> > > > list,
> > > >
> > > >   Features Mahout Spark  Clustering Y Y  Classification Y Y
> > Regression Y
> > > > Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y
> > Numpy N
> > > > Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A
> > scalability Y
> > > > Y
> > > >
> > >
> > > Mahout doesn't actually have strong features for clustering,
> > classification
> > > and regression. Mahout is very strong in recommendations (which you
> don't
> > > mention) and dimensionality reduction.
> > >
> > > Mahout does support scala in the development version.
> > >
> > > What do you mean by support for Numpy?
> > >
> >
>

Re: Mahout Vs Spark

Posted by Ted Dunning <te...@gmail.com>.

Hmmm....

I don't think that the array formats used by Spark are compatible with the
formats used by numpy.

I could be wrong, but even if there isn't outright incompatibility, there
is likely to be some significant overhead in format conversion.


On Tue, Oct 21, 2014 at 6:12 PM, Vibhanshu Prasad <vi...@gmail.com>
wrote:

> actually spark is available in python also, so users of spark are having an
> upper hand over users of traditional users of mahout. This is applicable to
> all the libraries of python (including numpy).
>
> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
> balijamahesh.mca@gmail.com
> > >
> > wrote:
> >
> > > I am trying to differentiate between Mahout and Spark, here is the
> small
> > > list,
> > >
> > >   Features Mahout Spark  Clustering Y Y  Classification Y Y
> Regression Y
> > > Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y
> Numpy N
> > > Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A
> scalability Y
> > > Y
> > >
> >
> > Mahout doesn't actually have strong features for clustering,
> classification
> > and regression. Mahout is very strong in recommendations (which you don't
> > mention) and dimensionality reduction.
> >
> > Mahout does support scala in the development version.
> >
> > What do you mean by support for Numpy?
> >
>

Re: Mahout Vs Spark

Posted by Nil Kulkarni <ni...@yahoo.com.INVALID>.

Personally, the most exciting additions that are coming to Mahout are
all the scala and spark bindings that are bringing linear algebra semantics.Far too often, a canned algorithm never works as expected, and asdata scientists we want to try out a bunch of algorithmic tweaks whichrequire getting into the code. With these bindings, we can try outmore researchy stuff and have it production ready.
Nilesh

     On Tuesday, October 21, 2014 8:01 PM, Lee S <sl...@gmail.com> wrote:

 As a developer, who is facing the library  chosen between mahout and mllib,
I have some idea below.
Mahout has no any decision tree algorithm. But MLLIB has the components of
constructing a decision tree algorithm such as gini index, information
gain. And also  I think mahout can add algorithm about frequency pattern
mining which is very import in feature selection and statistic analysis.
MLLIB has no frequent mining algorithms.
p.s Why fpgrowth algorithm is removed in version 0.9?

2014-10-22 9:12 GMT+08:00 Vibhanshu Prasad <vi...@gmail.com>:

> actually spark is available in python also, so users of spark are having an
> upper hand over users of traditional users of mahout. This is applicable to
> all the libraries of python (including numpy).
>
> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
> balijamahesh.mca@gmail.com
> > >
> > wrote:
> >
> > > I am trying to differentiate between Mahout and Spark, here is the
> small
> > > list,
> > >
> > >  Features Mahout Spark  Clustering Y Y  Classification Y Y
> Regression Y
> > > Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y
> Numpy N
> > > Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A
> scalability Y
> > > Y
> > >
> >
> > Mahout doesn't actually have strong features for clustering,
> classification
> > and regression. Mahout is very strong in recommendations (which you don't
> > mention) and dimensionality reduction.
> >
> > Mahout does support scala in the development version.
> >
> > What do you mean by support for Numpy?
> >
>

Re: Mahout Vs Spark

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Dmitriy,

My apologizes if I have conveyed my questions incorrectly.
Also my intentions are definitely NOT arguments.

I have experience with Mahout, I am also working on some content to make
Mahout simplified due to which I needed this clarifications. I am also
validating both the frameworks, just wanted to take some inputs from the
active contributors.

Best!
Mahesh Balija.






On Wed, Oct 22, 2014 at 6:57 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> For the record, this is all false dilemma (at least w.r.t. spark vs mahout
> spark bindings).
>
> The spark bindings have never been concieved as one vs another.
>
> Mahout scala bindings is on-top add-on to spark that just happens to rely
> on some of things in mahout-math.
>
> With spark one gets some major things being RDDS, mllib, spark QL and
> GraphX.
>
> Guess what, in Spark bindings one still gets all of those wonderful things,
> plus the bindings and bindings shell.
>
> Most add-on values in spark bindings are R-like notation for the algebra
> and distributed algebraic optimizer.  Of course there are all those
> wonderful distributed decompositions and pca things, naive base and i think
> some of co-occurrence stuff too. (implicit ALS work for spark was never
> committed, sadly, available on a PR branch only). internally my company
> have built several x more methodology code on spark bindings than spark
> binding has on its own.
>
> Spark bindings are also 100% scala. The only thing that is non-scala (at
> runtime)  is the in-memory Colt-derived matrix model, which is adapted to
> r-like dsl with scala bindings. Oh well. can't have it all.
>
> Bottom line,  for most part I feel you are building a straw man argument
> here. You presenting a problem as being a constrained choice with
> inevitable loss, whereas there has never been a loss of a choice. Even for
> the sake of algebraic decompositions and optimizations i feel there's a
> significant added value. (of course again this is only relevant to bindings
> stuff, not the 0.9 MR stuff all of which is now deprecated).
>
> The only two problems I see is that (1) Mahout takes in too much legacy
> dependencies that are hard to sort thru if one is using it strictly in
> spark base apps. Too many things to sort thru and throw away in that tree.
> I actually use an opt-in approach (that is, i remove all transitive
> dependencies by default and only add them one-by-one if there's actual
> runtime dependency). This is something that could, and should be improved
> incrementally.
>
> Second design problem is that Mahout may be a bit of a problem for using
> alongside other on-top-of-spark systems because it takes over some things
> in Spark (e.g. it requires things to work with kryo). But this is more of
> the Spark limitation itself.
>
>
> But speaking of "survival" and "popularity" concerns, which are very valid
> themselves, I think the major problem with Mahout is none of these alleged
> vs things. Strictly IMO it is that being a ML project, unlike all those
> other wonderful things, it is not widely backed by any major university or
> academic community.  It has never been. And at this point it would seem it
> will never be. As such, unlike with some other projects, there is no
> perpetual source of ambitious researchers to contribute. And original
> founders long since posted their last significant contribution.
>
> On Wed, Oct 22, 2014 at 9:20 AM, Mahesh Balija <balijamahesh.mca@gmail.com
> >
> wrote:
>
> > Hi Team,
> >
> > Thanks for your replies, even if you consider the strong implementation
> of
> > Recommendations and SVD in Mahout, I would still say that even in Spark
> > 1.1.0 there is support for collaborative filtering (alternating least
> > squares (ALS)) and under dimensionality reduction SVD and PCA. With fast
> > pace contributions, I believe Spark may NOT be far away to have new and
> > stable algorithms added to it (Like ANN, HMM etc and support for
> scientific
> > libraries).
> >
> > Ted, Even though Mahout (1.0) development code base support Scala and
> Spark
> > bindings externally, Spark has this inbuilt support for Scala (as its
> been
> > developed in Scala). And Numpy is a python based scientific library which
> > need to be used for the support of Python based MLlib in Spark. Benefits
> > are python is also supported in Spark for Python users.
> >
> > Major uniqueness of Mahout is, as Mahout is inherited from Lucene it has
> > built-in support for Text processing. Ofcourse I do NOT believe its a
> > strong point as I assume that, developers knowing Lucene can be able to
> > easily use it with Spark through Java interface.
> >
> > Mahout currently stopped support for Hadoop (i.e., for further libraries)
> > on the other hand Spark can re-use the data present in Hadoop/Hbase
> easily
> > (May NOT be mapreduce functionality as Spark has its own computation
> > layer).
> >
> > *As a user of Mahout since long time I strongly support Mahout (despite
> of
> > poor visualization capabilities), at the same time, I am trying to
> > understand if Spark continues to be evolved in MLLib package and being
> > support for in-memory computation and with rich scientific libraries
> > through Scala and support for languages like Java/Scala/Python will the
> > survival of Mahout be questionable?*
> >
> > Best!
> > Mahesh Balija.
> >
> >
> >
> > On Wed, Oct 22, 2014 at 1:26 PM, Martin, Nick <Ni...@pssd.com> wrote:
> >
> > > I know we lost the maintainer for fpgrowth somewhere along the line but
> > > it's definitely something I'd love to see carried forward, too.
> > >
> > > Sent from my iPhone
> > >
> > > > On Oct 22, 2014, at 8:09 AM, "Brian Dolan" <bu...@gmail.com>
> > wrote:
> > > >
> > > > Sing it, brother!  I miss FP Growth as well.  Once the Scala bindings
> > > are in, I'm hoping to work up some time series methods.
> > > >
> > > >> On Oct 21, 2014, at 8:00 PM, Lee S <sl...@gmail.com> wrote:
> > > >>
> > > >> As a developer, who is facing the library  chosen between mahout and
> > > mllib,
> > > >> I have some idea below.
> > > >> Mahout has no any decision tree algorithm. But MLLIB has the
> > components
> > > of
> > > >> constructing a decision tree algorithm such as gini index,
> information
> > > >> gain. And also  I think mahout can add algorithm about frequency
> > pattern
> > > >> mining which is very import in feature selection and statistic
> > analysis.
> > > >> MLLIB has no frequent mining algorithms.
> > > >> p.s Why fpgrowth algorithm is removed in version 0.9?
> > > >>
> > > >> 2014-10-22 9:12 GMT+08:00 Vibhanshu Prasad <
> vibhanshugsoc2@gmail.com
> > >:
> > > >>
> > > >>> actually spark is available in python also, so users of spark are
> > > having an
> > > >>> upper hand over users of traditional users of mahout. This is
> > > applicable to
> > > >>> all the libraries of python (including numpy).
> > > >>>
> > > >>> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <
> ted.dunning@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
> > > >>> balijamahesh.mca@gmail.com
> > > >>>> wrote:
> > > >>>>
> > > >>>>> I am trying to differentiate between Mahout and Spark, here is
> the
> > > >>> small
> > > >>>>> list,
> > > >>>>>
> > > >>>>> Features Mahout Spark  Clustering Y Y  Classification Y Y
> > > >>> Regression Y
> > > >>>>> Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y
> > > >>> Numpy N
> > > >>>>> Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A
> > > >>> scalability Y
> > > >>>>> Y
> > > >>>>
> > > >>>> Mahout doesn't actually have strong features for clustering,
> > > >>> classification
> > > >>>> and regression. Mahout is very strong in recommendations (which
> you
> > > don't
> > > >>>> mention) and dimensionality reduction.
> > > >>>>
> > > >>>> Mahout does support scala in the development version.
> > > >>>>
> > > >>>> What do you mean by support for Numpy?
> > > >
> > >
> >
>

Re: Mahout Vs Spark

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Dmitriy,

My apologizes if I have conveyed my questions incorrectly.
Also my intentions are definitely NOT arguments.

I have experience with Mahout, I am also working on some content to make
Mahout simplified due to which I needed this clarifications. I am also
validating both the frameworks, just wanted to take some inputs from the
active contributors.

Best!
Mahesh Balija.






On Wed, Oct 22, 2014 at 6:57 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> For the record, this is all false dilemma (at least w.r.t. spark vs mahout
> spark bindings).
>
> The spark bindings have never been concieved as one vs another.
>
> Mahout scala bindings is on-top add-on to spark that just happens to rely
> on some of things in mahout-math.
>
> With spark one gets some major things being RDDS, mllib, spark QL and
> GraphX.
>
> Guess what, in Spark bindings one still gets all of those wonderful things,
> plus the bindings and bindings shell.
>
> Most add-on values in spark bindings are R-like notation for the algebra
> and distributed algebraic optimizer.  Of course there are all those
> wonderful distributed decompositions and pca things, naive base and i think
> some of co-occurrence stuff too. (implicit ALS work for spark was never
> committed, sadly, available on a PR branch only). internally my company
> have built several x more methodology code on spark bindings than spark
> binding has on its own.
>
> Spark bindings are also 100% scala. The only thing that is non-scala (at
> runtime)  is the in-memory Colt-derived matrix model, which is adapted to
> r-like dsl with scala bindings. Oh well. can't have it all.
>
> Bottom line,  for most part I feel you are building a straw man argument
> here. You presenting a problem as being a constrained choice with
> inevitable loss, whereas there has never been a loss of a choice. Even for
> the sake of algebraic decompositions and optimizations i feel there's a
> significant added value. (of course again this is only relevant to bindings
> stuff, not the 0.9 MR stuff all of which is now deprecated).
>
> The only two problems I see is that (1) Mahout takes in too much legacy
> dependencies that are hard to sort thru if one is using it strictly in
> spark base apps. Too many things to sort thru and throw away in that tree.
> I actually use an opt-in approach (that is, i remove all transitive
> dependencies by default and only add them one-by-one if there's actual
> runtime dependency). This is something that could, and should be improved
> incrementally.
>
> Second design problem is that Mahout may be a bit of a problem for using
> alongside other on-top-of-spark systems because it takes over some things
> in Spark (e.g. it requires things to work with kryo). But this is more of
> the Spark limitation itself.
>
>
> But speaking of "survival" and "popularity" concerns, which are very valid
> themselves, I think the major problem with Mahout is none of these alleged
> vs things. Strictly IMO it is that being a ML project, unlike all those
> other wonderful things, it is not widely backed by any major university or
> academic community.  It has never been. And at this point it would seem it
> will never be. As such, unlike with some other projects, there is no
> perpetual source of ambitious researchers to contribute. And original
> founders long since posted their last significant contribution.
>
> On Wed, Oct 22, 2014 at 9:20 AM, Mahesh Balija <balijamahesh.mca@gmail.com
> >
> wrote:
>
> > Hi Team,
> >
> > Thanks for your replies, even if you consider the strong implementation
> of
> > Recommendations and SVD in Mahout, I would still say that even in Spark
> > 1.1.0 there is support for collaborative filtering (alternating least
> > squares (ALS)) and under dimensionality reduction SVD and PCA. With fast
> > pace contributions, I believe Spark may NOT be far away to have new and
> > stable algorithms added to it (Like ANN, HMM etc and support for
> scientific
> > libraries).
> >
> > Ted, Even though Mahout (1.0) development code base support Scala and
> Spark
> > bindings externally, Spark has this inbuilt support for Scala (as its
> been
> > developed in Scala). And Numpy is a python based scientific library which
> > need to be used for the support of Python based MLlib in Spark. Benefits
> > are python is also supported in Spark for Python users.
> >
> > Major uniqueness of Mahout is, as Mahout is inherited from Lucene it has
> > built-in support for Text processing. Ofcourse I do NOT believe its a
> > strong point as I assume that, developers knowing Lucene can be able to
> > easily use it with Spark through Java interface.
> >
> > Mahout currently stopped support for Hadoop (i.e., for further libraries)
> > on the other hand Spark can re-use the data present in Hadoop/Hbase
> easily
> > (May NOT be mapreduce functionality as Spark has its own computation
> > layer).
> >
> > *As a user of Mahout since long time I strongly support Mahout (despite
> of
> > poor visualization capabilities), at the same time, I am trying to
> > understand if Spark continues to be evolved in MLLib package and being
> > support for in-memory computation and with rich scientific libraries
> > through Scala and support for languages like Java/Scala/Python will the
> > survival of Mahout be questionable?*
> >
> > Best!
> > Mahesh Balija.
> >
> >
> >
> > On Wed, Oct 22, 2014 at 1:26 PM, Martin, Nick <Ni...@pssd.com> wrote:
> >
> > > I know we lost the maintainer for fpgrowth somewhere along the line but
> > > it's definitely something I'd love to see carried forward, too.
> > >
> > > Sent from my iPhone
> > >
> > > > On Oct 22, 2014, at 8:09 AM, "Brian Dolan" <bu...@gmail.com>
> > wrote:
> > > >
> > > > Sing it, brother!  I miss FP Growth as well.  Once the Scala bindings
> > > are in, I'm hoping to work up some time series methods.
> > > >
> > > >> On Oct 21, 2014, at 8:00 PM, Lee S <sl...@gmail.com> wrote:
> > > >>
> > > >> As a developer, who is facing the library  chosen between mahout and
> > > mllib,
> > > >> I have some idea below.
> > > >> Mahout has no any decision tree algorithm. But MLLIB has the
> > components
> > > of
> > > >> constructing a decision tree algorithm such as gini index,
> information
> > > >> gain. And also  I think mahout can add algorithm about frequency
> > pattern
> > > >> mining which is very import in feature selection and statistic
> > analysis.
> > > >> MLLIB has no frequent mining algorithms.
> > > >> p.s Why fpgrowth algorithm is removed in version 0.9?
> > > >>
> > > >> 2014-10-22 9:12 GMT+08:00 Vibhanshu Prasad <
> vibhanshugsoc2@gmail.com
> > >:
> > > >>
> > > >>> actually spark is available in python also, so users of spark are
> > > having an
> > > >>> upper hand over users of traditional users of mahout. This is
> > > applicable to
> > > >>> all the libraries of python (including numpy).
> > > >>>
> > > >>> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <
> ted.dunning@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
> > > >>> balijamahesh.mca@gmail.com
> > > >>>> wrote:
> > > >>>>
> > > >>>>> I am trying to differentiate between Mahout and Spark, here is
> the
> > > >>> small
> > > >>>>> list,
> > > >>>>>
> > > >>>>> Features Mahout Spark  Clustering Y Y  Classification Y Y
> > > >>> Regression Y
> > > >>>>> Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y
> > > >>> Numpy N
> > > >>>>> Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A
> > > >>> scalability Y
> > > >>>>> Y
> > > >>>>
> > > >>>> Mahout doesn't actually have strong features for clustering,
> > > >>> classification
> > > >>>> and regression. Mahout is very strong in recommendations (which
> you
> > > don't
> > > >>>> mention) and dimensionality reduction.
> > > >>>>
> > > >>>> Mahout does support scala in the development version.
> > > >>>>
> > > >>>> What do you mean by support for Numpy?
> > > >
> > >
> >
>

Re: Mahout Vs Spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

For the record, this is all false dilemma (at least w.r.t. spark vs mahout
spark bindings).

The spark bindings have never been concieved as one vs another.

Mahout scala bindings is on-top add-on to spark that just happens to rely
on some of things in mahout-math.

With spark one gets some major things being RDDS, mllib, spark QL and
GraphX.

Guess what, in Spark bindings one still gets all of those wonderful things,
plus the bindings and bindings shell.

Most add-on values in spark bindings are R-like notation for the algebra
and distributed algebraic optimizer.  Of course there are all those
wonderful distributed decompositions and pca things, naive base and i think
some of co-occurrence stuff too. (implicit ALS work for spark was never
committed, sadly, available on a PR branch only). internally my company
have built several x more methodology code on spark bindings than spark
binding has on its own.

Spark bindings are also 100% scala. The only thing that is non-scala (at
runtime)  is the in-memory Colt-derived matrix model, which is adapted to
r-like dsl with scala bindings. Oh well. can't have it all.

Bottom line,  for most part I feel you are building a straw man argument
here. You presenting a problem as being a constrained choice with
inevitable loss, whereas there has never been a loss of a choice. Even for
the sake of algebraic decompositions and optimizations i feel there's a
significant added value. (of course again this is only relevant to bindings
stuff, not the 0.9 MR stuff all of which is now deprecated).

The only two problems I see is that (1) Mahout takes in too much legacy
dependencies that are hard to sort thru if one is using it strictly in
spark base apps. Too many things to sort thru and throw away in that tree.
I actually use an opt-in approach (that is, i remove all transitive
dependencies by default and only add them one-by-one if there's actual
runtime dependency). This is something that could, and should be improved
incrementally.

Second design problem is that Mahout may be a bit of a problem for using
alongside other on-top-of-spark systems because it takes over some things
in Spark (e.g. it requires things to work with kryo). But this is more of
the Spark limitation itself.

But speaking of "survival" and "popularity" concerns, which are very valid
themselves, I think the major problem with Mahout is none of these alleged
vs things. Strictly IMO it is that being a ML project, unlike all those
other wonderful things, it is not widely backed by any major university or
academic community.  It has never been. And at this point it would seem it
will never be. As such, unlike with some other projects, there is no
perpetual source of ambitious researchers to contribute. And original
founders long since posted their last significant contribution.

On Wed, Oct 22, 2014 at 9:20 AM, Mahesh Balija <ba...@gmail.com>
wrote:

> Hi Team,
>
> Thanks for your replies, even if you consider the strong implementation of
> Recommendations and SVD in Mahout, I would still say that even in Spark
> 1.1.0 there is support for collaborative filtering (alternating least
> squares (ALS)) and under dimensionality reduction SVD and PCA. With fast
> pace contributions, I believe Spark may NOT be far away to have new and
> stable algorithms added to it (Like ANN, HMM etc and support for scientific
> libraries).
>
> Ted, Even though Mahout (1.0) development code base support Scala and Spark
> bindings externally, Spark has this inbuilt support for Scala (as its been
> developed in Scala). And Numpy is a python based scientific library which
> need to be used for the support of Python based MLlib in Spark. Benefits
> are python is also supported in Spark for Python users.
>
> Major uniqueness of Mahout is, as Mahout is inherited from Lucene it has
> built-in support for Text processing. Ofcourse I do NOT believe its a
> strong point as I assume that, developers knowing Lucene can be able to
> easily use it with Spark through Java interface.
>
> Mahout currently stopped support for Hadoop (i.e., for further libraries)
> on the other hand Spark can re-use the data present in Hadoop/Hbase easily
> (May NOT be mapreduce functionality as Spark has its own computation
> layer).
>
> *As a user of Mahout since long time I strongly support Mahout (despite of
> poor visualization capabilities), at the same time, I am trying to
> understand if Spark continues to be evolved in MLLib package and being
> support for in-memory computation and with rich scientific libraries
> through Scala and support for languages like Java/Scala/Python will the
> survival of Mahout be questionable?*
>
> Best!
> Mahesh Balija.
>
>
>
> On Wed, Oct 22, 2014 at 1:26 PM, Martin, Nick <Ni...@pssd.com> wrote:
>
> > I know we lost the maintainer for fpgrowth somewhere along the line but
> > it's definitely something I'd love to see carried forward, too.
> >
> > Sent from my iPhone
> >
> > > On Oct 22, 2014, at 8:09 AM, "Brian Dolan" <bu...@gmail.com>
> wrote:
> > >
> > > Sing it, brother!  I miss FP Growth as well.  Once the Scala bindings
> > are in, I'm hoping to work up some time series methods.
> > >
> > >> On Oct 21, 2014, at 8:00 PM, Lee S <sl...@gmail.com> wrote:
> > >>
> > >> As a developer, who is facing the library  chosen between mahout and
> > mllib,
> > >> I have some idea below.
> > >> Mahout has no any decision tree algorithm. But MLLIB has the
> components
> > of
> > >> constructing a decision tree algorithm such as gini index, information
> > >> gain. And also  I think mahout can add algorithm about frequency
> pattern
> > >> mining which is very import in feature selection and statistic
> analysis.
> > >> MLLIB has no frequent mining algorithms.
> > >> p.s Why fpgrowth algorithm is removed in version 0.9?
> > >>
> > >> 2014-10-22 9:12 GMT+08:00 Vibhanshu Prasad <vibhanshugsoc2@gmail.com
> >:
> > >>
> > >>> actually spark is available in python also, so users of spark are
> > having an
> > >>> upper hand over users of traditional users of mahout. This is
> > applicable to
> > >>> all the libraries of python (including numpy).
> > >>>
> > >>> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <te...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
> > >>> balijamahesh.mca@gmail.com
> > >>>> wrote:
> > >>>>
> > >>>>> I am trying to differentiate between Mahout and Spark, here is the
> > >>> small
> > >>>>> list,
> > >>>>>
> > >>>>> Features Mahout Spark  Clustering Y Y  Classification Y Y
> > >>> Regression Y
> > >>>>> Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y
> > >>> Numpy N
> > >>>>> Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A
> > >>> scalability Y
> > >>>>> Y
> > >>>>
> > >>>> Mahout doesn't actually have strong features for clustering,
> > >>> classification
> > >>>> and regression. Mahout is very strong in recommendations (which you
> > don't
> > >>>> mention) and dimensionality reduction.
> > >>>>
> > >>>> Mahout does support scala in the development version.
> > >>>>
> > >>>> What do you mean by support for Numpy?
> > >
> >
>

Re: Mahout Vs Spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

For the record, this is all false dilemma (at least w.r.t. spark vs mahout
spark bindings).

The spark bindings have never been concieved as one vs another.

Mahout scala bindings is on-top add-on to spark that just happens to rely
on some of things in mahout-math.

With spark one gets some major things being RDDS, mllib, spark QL and
GraphX.

Guess what, in Spark bindings one still gets all of those wonderful things,
plus the bindings and bindings shell.

Most add-on values in spark bindings are R-like notation for the algebra
and distributed algebraic optimizer.  Of course there are all those
wonderful distributed decompositions and pca things, naive base and i think
some of co-occurrence stuff too. (implicit ALS work for spark was never
committed, sadly, available on a PR branch only). internally my company
have built several x more methodology code on spark bindings than spark
binding has on its own.

Spark bindings are also 100% scala. The only thing that is non-scala (at
runtime)  is the in-memory Colt-derived matrix model, which is adapted to
r-like dsl with scala bindings. Oh well. can't have it all.

Bottom line,  for most part I feel you are building a straw man argument
here. You presenting a problem as being a constrained choice with
inevitable loss, whereas there has never been a loss of a choice. Even for
the sake of algebraic decompositions and optimizations i feel there's a
significant added value. (of course again this is only relevant to bindings
stuff, not the 0.9 MR stuff all of which is now deprecated).

The only two problems I see is that (1) Mahout takes in too much legacy
dependencies that are hard to sort thru if one is using it strictly in
spark base apps. Too many things to sort thru and throw away in that tree.
I actually use an opt-in approach (that is, i remove all transitive
dependencies by default and only add them one-by-one if there's actual
runtime dependency). This is something that could, and should be improved
incrementally.

Second design problem is that Mahout may be a bit of a problem for using
alongside other on-top-of-spark systems because it takes over some things
in Spark (e.g. it requires things to work with kryo). But this is more of
the Spark limitation itself.

But speaking of "survival" and "popularity" concerns, which are very valid
themselves, I think the major problem with Mahout is none of these alleged
vs things. Strictly IMO it is that being a ML project, unlike all those
other wonderful things, it is not widely backed by any major university or
academic community.  It has never been. And at this point it would seem it
will never be. As such, unlike with some other projects, there is no
perpetual source of ambitious researchers to contribute. And original
founders long since posted their last significant contribution.

On Wed, Oct 22, 2014 at 9:20 AM, Mahesh Balija <ba...@gmail.com>
wrote:

> Hi Team,
>
> Thanks for your replies, even if you consider the strong implementation of
> Recommendations and SVD in Mahout, I would still say that even in Spark
> 1.1.0 there is support for collaborative filtering (alternating least
> squares (ALS)) and under dimensionality reduction SVD and PCA. With fast
> pace contributions, I believe Spark may NOT be far away to have new and
> stable algorithms added to it (Like ANN, HMM etc and support for scientific
> libraries).
>
> Ted, Even though Mahout (1.0) development code base support Scala and Spark
> bindings externally, Spark has this inbuilt support for Scala (as its been
> developed in Scala). And Numpy is a python based scientific library which
> need to be used for the support of Python based MLlib in Spark. Benefits
> are python is also supported in Spark for Python users.
>
> Major uniqueness of Mahout is, as Mahout is inherited from Lucene it has
> built-in support for Text processing. Ofcourse I do NOT believe its a
> strong point as I assume that, developers knowing Lucene can be able to
> easily use it with Spark through Java interface.
>
> Mahout currently stopped support for Hadoop (i.e., for further libraries)
> on the other hand Spark can re-use the data present in Hadoop/Hbase easily
> (May NOT be mapreduce functionality as Spark has its own computation
> layer).
>
> *As a user of Mahout since long time I strongly support Mahout (despite of
> poor visualization capabilities), at the same time, I am trying to
> understand if Spark continues to be evolved in MLLib package and being
> support for in-memory computation and with rich scientific libraries
> through Scala and support for languages like Java/Scala/Python will the
> survival of Mahout be questionable?*
>
> Best!
> Mahesh Balija.
>
>
>
> On Wed, Oct 22, 2014 at 1:26 PM, Martin, Nick <Ni...@pssd.com> wrote:
>
> > I know we lost the maintainer for fpgrowth somewhere along the line but
> > it's definitely something I'd love to see carried forward, too.
> >
> > Sent from my iPhone
> >
> > > On Oct 22, 2014, at 8:09 AM, "Brian Dolan" <bu...@gmail.com>
> wrote:
> > >
> > > Sing it, brother!  I miss FP Growth as well.  Once the Scala bindings
> > are in, I'm hoping to work up some time series methods.
> > >
> > >> On Oct 21, 2014, at 8:00 PM, Lee S <sl...@gmail.com> wrote:
> > >>
> > >> As a developer, who is facing the library  chosen between mahout and
> > mllib,
> > >> I have some idea below.
> > >> Mahout has no any decision tree algorithm. But MLLIB has the
> components
> > of
> > >> constructing a decision tree algorithm such as gini index, information
> > >> gain. And also  I think mahout can add algorithm about frequency
> pattern
> > >> mining which is very import in feature selection and statistic
> analysis.
> > >> MLLIB has no frequent mining algorithms.
> > >> p.s Why fpgrowth algorithm is removed in version 0.9?
> > >>
> > >> 2014-10-22 9:12 GMT+08:00 Vibhanshu Prasad <vibhanshugsoc2@gmail.com
> >:
> > >>
> > >>> actually spark is available in python also, so users of spark are
> > having an
> > >>> upper hand over users of traditional users of mahout. This is
> > applicable to
> > >>> all the libraries of python (including numpy).
> > >>>
> > >>> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <te...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
> > >>> balijamahesh.mca@gmail.com
> > >>>> wrote:
> > >>>>
> > >>>>> I am trying to differentiate between Mahout and Spark, here is the
> > >>> small
> > >>>>> list,
> > >>>>>
> > >>>>> Features Mahout Spark  Clustering Y Y  Classification Y Y
> > >>> Regression Y
> > >>>>> Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y
> > >>> Numpy N
> > >>>>> Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A
> > >>> scalability Y
> > >>>>> Y
> > >>>>
> > >>>> Mahout doesn't actually have strong features for clustering,
> > >>> classification
> > >>>> and regression. Mahout is very strong in recommendations (which you
> > don't
> > >>>> mention) and dimensionality reduction.
> > >>>>
> > >>>> Mahout does support scala in the development version.
> > >>>>
> > >>>> What do you mean by support for Numpy?
> > >
> >
>

Re: Mahout Vs Spark

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Team,

Thanks for your replies, even if you consider the strong implementation of
Recommendations and SVD in Mahout, I would still say that even in Spark
1.1.0 there is support for collaborative filtering (alternating least
squares (ALS)) and under dimensionality reduction SVD and PCA. With fast
pace contributions, I believe Spark may NOT be far away to have new and
stable algorithms added to it (Like ANN, HMM etc and support for scientific
libraries).

Ted, Even though Mahout (1.0) development code base support Scala and Spark
bindings externally, Spark has this inbuilt support for Scala (as its been
developed in Scala). And Numpy is a python based scientific library which
need to be used for the support of Python based MLlib in Spark. Benefits
are python is also supported in Spark for Python users.

Major uniqueness of Mahout is, as Mahout is inherited from Lucene it has
built-in support for Text processing. Ofcourse I do NOT believe its a
strong point as I assume that, developers knowing Lucene can be able to
easily use it with Spark through Java interface.

Mahout currently stopped support for Hadoop (i.e., for further libraries)
on the other hand Spark can re-use the data present in Hadoop/Hbase easily
(May NOT be mapreduce functionality as Spark has its own computation layer).

*As a user of Mahout since long time I strongly support Mahout (despite of
poor visualization capabilities), at the same time, I am trying to
understand if Spark continues to be evolved in MLLib package and being
support for in-memory computation and with rich scientific libraries
through Scala and support for languages like Java/Scala/Python will the
survival of Mahout be questionable?*

Best!
Mahesh Balija.

On Wed, Oct 22, 2014 at 1:26 PM, Martin, Nick <Ni...@pssd.com> wrote:

> I know we lost the maintainer for fpgrowth somewhere along the line but
> it's definitely something I'd love to see carried forward, too.
>
> Sent from my iPhone
>
> > On Oct 22, 2014, at 8:09 AM, "Brian Dolan" <bu...@gmail.com> wrote:
> >
> > Sing it, brother!  I miss FP Growth as well.  Once the Scala bindings
> are in, I'm hoping to work up some time series methods.
> >
> >> On Oct 21, 2014, at 8:00 PM, Lee S <sl...@gmail.com> wrote:
> >>
> >> As a developer, who is facing the library  chosen between mahout and
> mllib,
> >> I have some idea below.
> >> Mahout has no any decision tree algorithm. But MLLIB has the components
> of
> >> constructing a decision tree algorithm such as gini index, information
> >> gain. And also  I think mahout can add algorithm about frequency pattern
> >> mining which is very import in feature selection and statistic analysis.
> >> MLLIB has no frequent mining algorithms.
> >> p.s Why fpgrowth algorithm is removed in version 0.9?
> >>
> >> 2014-10-22 9:12 GMT+08:00 Vibhanshu Prasad <vi...@gmail.com>:
> >>
> >>> actually spark is available in python also, so users of spark are
> having an
> >>> upper hand over users of traditional users of mahout. This is
> applicable to
> >>> all the libraries of python (including numpy).
> >>>
> >>> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>
> >>>> On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
> >>> balijamahesh.mca@gmail.com
> >>>> wrote:
> >>>>
> >>>>> I am trying to differentiate between Mahout and Spark, here is the
> >>> small
> >>>>> list,
> >>>>>
> >>>>> Features Mahout Spark  Clustering Y Y  Classification Y Y
> >>> Regression Y
> >>>>> Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y
> >>> Numpy N
> >>>>> Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A
> >>> scalability Y
> >>>>> Y
> >>>>
> >>>> Mahout doesn't actually have strong features for clustering,
> >>> classification
> >>>> and regression. Mahout is very strong in recommendations (which you
> don't
> >>>> mention) and dimensionality reduction.
> >>>>
> >>>> Mahout does support scala in the development version.
> >>>>
> >>>> What do you mean by support for Numpy?
> >
>

Re: Mahout Vs Spark

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Team,

Thanks for your replies, even if you consider the strong implementation of
Recommendations and SVD in Mahout, I would still say that even in Spark
1.1.0 there is support for collaborative filtering (alternating least
squares (ALS)) and under dimensionality reduction SVD and PCA. With fast
pace contributions, I believe Spark may NOT be far away to have new and
stable algorithms added to it (Like ANN, HMM etc and support for scientific
libraries).

Ted, Even though Mahout (1.0) development code base support Scala and Spark
bindings externally, Spark has this inbuilt support for Scala (as its been
developed in Scala). And Numpy is a python based scientific library which
need to be used for the support of Python based MLlib in Spark. Benefits
are python is also supported in Spark for Python users.

Major uniqueness of Mahout is, as Mahout is inherited from Lucene it has
built-in support for Text processing. Ofcourse I do NOT believe its a
strong point as I assume that, developers knowing Lucene can be able to
easily use it with Spark through Java interface.

Mahout currently stopped support for Hadoop (i.e., for further libraries)
on the other hand Spark can re-use the data present in Hadoop/Hbase easily
(May NOT be mapreduce functionality as Spark has its own computation layer).

*As a user of Mahout since long time I strongly support Mahout (despite of
poor visualization capabilities), at the same time, I am trying to
understand if Spark continues to be evolved in MLLib package and being
support for in-memory computation and with rich scientific libraries
through Scala and support for languages like Java/Scala/Python will the
survival of Mahout be questionable?*

Best!
Mahesh Balija.

On Wed, Oct 22, 2014 at 1:26 PM, Martin, Nick <Ni...@pssd.com> wrote:

> I know we lost the maintainer for fpgrowth somewhere along the line but
> it's definitely something I'd love to see carried forward, too.
>
> Sent from my iPhone
>
> > On Oct 22, 2014, at 8:09 AM, "Brian Dolan" <bu...@gmail.com> wrote:
> >
> > Sing it, brother!  I miss FP Growth as well.  Once the Scala bindings
> are in, I'm hoping to work up some time series methods.
> >
> >> On Oct 21, 2014, at 8:00 PM, Lee S <sl...@gmail.com> wrote:
> >>
> >> As a developer, who is facing the library  chosen between mahout and
> mllib,
> >> I have some idea below.
> >> Mahout has no any decision tree algorithm. But MLLIB has the components
> of
> >> constructing a decision tree algorithm such as gini index, information
> >> gain. And also  I think mahout can add algorithm about frequency pattern
> >> mining which is very import in feature selection and statistic analysis.
> >> MLLIB has no frequent mining algorithms.
> >> p.s Why fpgrowth algorithm is removed in version 0.9?
> >>
> >> 2014-10-22 9:12 GMT+08:00 Vibhanshu Prasad <vi...@gmail.com>:
> >>
> >>> actually spark is available in python also, so users of spark are
> having an
> >>> upper hand over users of traditional users of mahout. This is
> applicable to
> >>> all the libraries of python (including numpy).
> >>>
> >>> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>
> >>>> On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
> >>> balijamahesh.mca@gmail.com
> >>>> wrote:
> >>>>
> >>>>> I am trying to differentiate between Mahout and Spark, here is the
> >>> small
> >>>>> list,
> >>>>>
> >>>>> Features Mahout Spark  Clustering Y Y  Classification Y Y
> >>> Regression Y
> >>>>> Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y
> >>> Numpy N
> >>>>> Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A
> >>> scalability Y
> >>>>> Y
> >>>>
> >>>> Mahout doesn't actually have strong features for clustering,
> >>> classification
> >>>> and regression. Mahout is very strong in recommendations (which you
> don't
> >>>> mention) and dimensionality reduction.
> >>>>
> >>>> Mahout does support scala in the development version.
> >>>>
> >>>> What do you mean by support for Numpy?
> >
>

Re: Mahout Vs Spark

Posted by "Martin, Nick" <Ni...@pssd.com>.

I know we lost the maintainer for fpgrowth somewhere along the line but it's definitely something I'd love to see carried forward, too.

Sent from my iPhone

> On Oct 22, 2014, at 8:09 AM, "Brian Dolan" <bu...@gmail.com> wrote:
> 
> Sing it, brother!  I miss FP Growth as well.  Once the Scala bindings are in, I'm hoping to work up some time series methods.
> 
>> On Oct 21, 2014, at 8:00 PM, Lee S <sl...@gmail.com> wrote:
>> 
>> As a developer, who is facing the library  chosen between mahout and mllib,
>> I have some idea below.
>> Mahout has no any decision tree algorithm. But MLLIB has the components of
>> constructing a decision tree algorithm such as gini index, information
>> gain. And also  I think mahout can add algorithm about frequency pattern
>> mining which is very import in feature selection and statistic analysis.
>> MLLIB has no frequent mining algorithms.
>> p.s Why fpgrowth algorithm is removed in version 0.9?
>> 
>> 2014-10-22 9:12 GMT+08:00 Vibhanshu Prasad <vi...@gmail.com>:
>> 
>>> actually spark is available in python also, so users of spark are having an
>>> upper hand over users of traditional users of mahout. This is applicable to
>>> all the libraries of python (including numpy).
>>> 
>>> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>> 
>>>> On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
>>> balijamahesh.mca@gmail.com
>>>> wrote:
>>>> 
>>>>> I am trying to differentiate between Mahout and Spark, here is the
>>> small
>>>>> list,
>>>>> 
>>>>> Features Mahout Spark  Clustering Y Y  Classification Y Y
>>> Regression Y
>>>>> Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y
>>> Numpy N
>>>>> Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A
>>> scalability Y
>>>>> Y
>>>> 
>>>> Mahout doesn't actually have strong features for clustering,
>>> classification
>>>> and regression. Mahout is very strong in recommendations (which you don't
>>>> mention) and dimensionality reduction.
>>>> 
>>>> Mahout does support scala in the development version.
>>>> 
>>>> What do you mean by support for Numpy?
>

Re: Mahout Vs Spark

Posted by Brian Dolan <bu...@gmail.com>.

Sing it, brother!  I miss FP Growth as well.  Once the Scala bindings are in, I'm hoping to work up some time series methods.

On Oct 21, 2014, at 8:00 PM, Lee S <sl...@gmail.com> wrote:

> As a developer, who is facing the library  chosen between mahout and mllib,
> I have some idea below.
> Mahout has no any decision tree algorithm. But MLLIB has the components of
> constructing a decision tree algorithm such as gini index, information
> gain. And also  I think mahout can add algorithm about frequency pattern
> mining which is very import in feature selection and statistic analysis.
> MLLIB has no frequent mining algorithms.
> p.s Why fpgrowth algorithm is removed in version 0.9?
> 
> 2014-10-22 9:12 GMT+08:00 Vibhanshu Prasad <vi...@gmail.com>:
> 
>> actually spark is available in python also, so users of spark are having an
>> upper hand over users of traditional users of mahout. This is applicable to
>> all the libraries of python (including numpy).
>> 
>> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>> 
>>> On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
>> balijamahesh.mca@gmail.com
>>>> 
>>> wrote:
>>> 
>>>> I am trying to differentiate between Mahout and Spark, here is the
>> small
>>>> list,
>>>> 
>>>>  Features Mahout Spark  Clustering Y Y  Classification Y Y
>> Regression Y
>>>> Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y
>> Numpy N
>>>> Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A
>> scalability Y
>>>> Y
>>>> 
>>> 
>>> Mahout doesn't actually have strong features for clustering,
>> classification
>>> and regression. Mahout is very strong in recommendations (which you don't
>>> mention) and dimensionality reduction.
>>> 
>>> Mahout does support scala in the development version.
>>> 
>>> What do you mean by support for Numpy?
>>> 
>>

Re: Mahout Vs Spark

Posted by Lee S <sl...@gmail.com>.

As a developer, who is facing the library  chosen between mahout and mllib,
I have some idea below.
Mahout has no any decision tree algorithm. But MLLIB has the components of
constructing a decision tree algorithm such as gini index, information
gain. And also  I think mahout can add algorithm about frequency pattern
mining which is very import in feature selection and statistic analysis.
MLLIB has no frequent mining algorithms.
p.s Why fpgrowth algorithm is removed in version 0.9?

2014-10-22 9:12 GMT+08:00 Vibhanshu Prasad <vi...@gmail.com>:

> actually spark is available in python also, so users of spark are having an
> upper hand over users of traditional users of mahout. This is applicable to
> all the libraries of python (including numpy).
>
> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
> balijamahesh.mca@gmail.com
> > >
> > wrote:
> >
> > > I am trying to differentiate between Mahout and Spark, here is the
> small
> > > list,
> > >
> > >   Features Mahout Spark  Clustering Y Y  Classification Y Y
> Regression Y
> > > Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y
> Numpy N
> > > Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A
> scalability Y
> > > Y
> > >
> >
> > Mahout doesn't actually have strong features for clustering,
> classification
> > and regression. Mahout is very strong in recommendations (which you don't
> > mention) and dimensionality reduction.
> >
> > Mahout does support scala in the development version.
> >
> > What do you mean by support for Numpy?
> >
>

Re: Mahout Vs Spark

Posted by Vibhanshu Prasad <vi...@gmail.com>.

actually spark is available in python also, so users of spark are having an
upper hand over users of traditional users of mahout. This is applicable to
all the libraries of python (including numpy).

On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <te...@gmail.com> wrote:

> On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <balijamahesh.mca@gmail.com
> >
> wrote:
>
> > I am trying to differentiate between Mahout and Spark, here is the small
> > list,
> >
> >   Features Mahout Spark  Clustering Y Y  Classification Y Y  Regression Y
> > Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y  Numpy N
> > Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A  scalability Y
> > Y
> >
>
> Mahout doesn't actually have strong features for clustering, classification
> and regression. Mahout is very strong in recommendations (which you don't
> mention) and dimensionality reduction.
>
> Mahout does support scala in the development version.
>
> What do you mean by support for Numpy?
>

Re: Mahout Vs Spark

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <ba...@gmail.com>
wrote:

> I am trying to differentiate between Mahout and Spark, here is the small
> list,
>
>   Features Mahout Spark  Clustering Y Y  Classification Y Y  Regression Y
> Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y  Numpy N
> Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A  scalability Y
> Y
>

Mahout doesn't actually have strong features for clustering, classification
and regression. Mahout is very strong in recommendations (which you don't
mention) and dimensionality reduction.

Mahout does support scala in the development version.

What do you mean by support for Numpy?

Re: Mahout Vs Spark

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <ba...@gmail.com>
wrote:

> I am trying to differentiate between Mahout and Spark, here is the small
> list,
>
>   Features Mahout Spark  Clustering Y Y  Classification Y Y  Regression Y
> Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y  Numpy N
> Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A  scalability Y
> Y
>

Mahout doesn't actually have strong features for clustering, classification
and regression. Mahout is very strong in recommendations (which you don't
mention) and dimensionality reduction.

Mahout does support scala in the development version.

What do you mean by support for Numpy?