You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Andrew Palumbo <ap...@outlook.com> on 2016/07/21 02:35:00 UTC

Traits for a mahout algorithm Library.

Hi All,


I'd like to draw your attention to MAHOUT-1856:  https://issues.apache.org/jira/browse/MAHOUT-1856


This is a discussion that has popped up several times over the last couple of years. as we move towards building out our algorithm library, It would be great  to nail this down now.


Most Importantly to not be able to be criticized as "a loose bag of algorithms" as we've sometimes been in the past.


The main point being It would be good to lay out  common traits for Classification, Clustering, and Optimization algorithms.


This is just a start. I created this issue a few months back, and intentionally left off Recommender, because I was unsure if there were common traits across them.  By traits, I am referring to both both the literal meaning and more specifically, actual Scala traits.


@pat, @tdunning, @ssc, could you give your thoughts on this?


As well, it would be good to add online flavors of different algorithm classes into the mix.


@tdunning could you share some thoughts here?


Trevor Grant will be heading up this effort, and It would be great if we all as a team could come up with abstract design plans for each class of algorithm (as well as to determine the current "classes of algorithms", as each of us has our own unique blend of specializations.  And could give our thoughts on this.


Currently this is really the opening of the conversation.


It would be best to post thoughts on: https://issues.apache.org/jira/browse/MAHOUT-1856


Any feedback is welcomed.


Thanks,


Andy



Re: Traits for a mahout algorithm Library.

Posted by Andrew Palumbo <ap...@outlook.com>.
Awesome- Sounds like we've got a good plan going.  A pipeline style sounds good to me.


So as I understand from reading today's discussions - been kind of following all day- we really need figure an optimal way of keeping mahout both  (primarily) a "Roll your own math/algos" platform and a library (both perceived and in reality).  And a library that one could plug in their own previously "rolled" math/algos.


Re: DataFrames, we should be able to set this these pipelines up in a way that is abstract enough, ie at the math-scala level to so that any engine agnostic pipelines can be run without DataFrame or (or other Spark dependencies), and then drop down into the Spark module and add DataFrame, etc capabilities, correct?


So we'd have eg. o.a.m.library in both with Spark specific-algos in the spark module, and engine- agnostic in math-scala - same as we do with everything else.


I guess thats basically what Sebastian was suggesting earlier in the thread.


We can also make use of certain MLlib algos in the Spark module with conversions to/from Drm format, and further push the fact the we are a complement to MLlib rather than competition.


Sorry if I'm just repeating what you guys have hashed out today..


+1 to Hyperparamater search that may include feature extraction.









________________________________
From: Dmitriy Lyubimov <dl...@gmail.com>
Sent: Thursday, July 21, 2016 4:47:26 PM
To: dev@mahout.apache.org
Cc: Sebastian Schelter
Subject: Re: Traits for a mahout algorithm Library.

On Thu, Jul 21, 2016 at 12:35 PM, Trevor Grant <tr...@gmail.com>
wrote:

>
>
> Finally, re data-frames.  Why not leave it as vectors and matrices?
>

Short answer: because (imo) data frames are not vectors and matrices.

Longer argumentation:

Some capabilities expected of data frames are as follows.

DFs are columnar tables where columns are either named vectors or named
factors (in R sense).

Also, operationally DFs are usually more leaning on providing relational
algebra capabilities (joins etc.)  than on numerical algebra (blas3).

A factor (or, perhaps a better term, a categorical feature) is
fundamentally a non-numerical data. It's representation of a categorical
data which could be bounded or unbounded in number of categories.

Further more, there is more than one way to vectorize a factor or a group
of factors, which is what formula and other things are called for doing.

Now you might view all these formulas, factors and hash tricks as feature
preparation activity and say that learning process is not bothered by that.
In the end, every fitting is essentially working on a numerical input.

That's unfortunately may not be quite true.

Model search (step-wise GLM, for esxample) is not necessarily a
numerical-only thing since it essentially manages factor vectorization.

That said, i think we can safely say that individual learner could be a
numerical-only thing. But as soon as we go up the chain to transformations,
vectorizations and searching for parameters of vectorizations, dataframes
are usually input sources for all those.

excellent example of those (which was failed to get properly architected by
concerns in that another OSS project) is implicit feedback recommender.

In fact, there are two problems here -- one is parameterized feature
extraction and another is fitting the decomposition.

each of the problems have its own parameters. In vanilla paper
implementation there were two suggested ways of feature extraction that
offered one parameter each, and then were suggested to be searched for via
CV along with the fitter hyperparameters (learning rate, regularization).

What it means is that hyperparameter search may overarch feature extraction
_and_ fitting and essentially may require a data frame as an input in most
general case (and i ran into such practical case before).

Finally, some goodness of fit metrics work on pre-vectorized factors.

This is all standard but it is all pretty expensive to do unfortunately. I
have big problem discarding notion of dataframe support as part of the
fitting/search process for some areas of computational statistics.

Re: Traits for a mahout algorithm Library.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Thu, Jul 21, 2016 at 12:35 PM, Trevor Grant <tr...@gmail.com>
wrote:

>
>
> Finally, re data-frames.  Why not leave it as vectors and matrices?
>

Short answer: because (imo) data frames are not vectors and matrices.

Longer argumentation:

Some capabilities expected of data frames are as follows.

DFs are columnar tables where columns are either named vectors or named
factors (in R sense).

Also, operationally DFs are usually more leaning on providing relational
algebra capabilities (joins etc.)  than on numerical algebra (blas3).

A factor (or, perhaps a better term, a categorical feature) is
fundamentally a non-numerical data. It's representation of a categorical
data which could be bounded or unbounded in number of categories.

Further more, there is more than one way to vectorize a factor or a group
of factors, which is what formula and other things are called for doing.

Now you might view all these formulas, factors and hash tricks as feature
preparation activity and say that learning process is not bothered by that.
In the end, every fitting is essentially working on a numerical input.

That's unfortunately may not be quite true.

Model search (step-wise GLM, for esxample) is not necessarily a
numerical-only thing since it essentially manages factor vectorization.

That said, i think we can safely say that individual learner could be a
numerical-only thing. But as soon as we go up the chain to transformations,
vectorizations and searching for parameters of vectorizations, dataframes
are usually input sources for all those.

excellent example of those (which was failed to get properly architected by
concerns in that another OSS project) is implicit feedback recommender.

In fact, there are two problems here -- one is parameterized feature
extraction and another is fitting the decomposition.

each of the problems have its own parameters. In vanilla paper
implementation there were two suggested ways of feature extraction that
offered one parameter each, and then were suggested to be searched for via
CV along with the fitter hyperparameters (learning rate, regularization).

What it means is that hyperparameter search may overarch feature extraction
_and_ fitting and essentially may require a data frame as an input in most
general case (and i ran into such practical case before).

Finally, some goodness of fit metrics work on pre-vectorized factors.

This is all standard but it is all pretty expensive to do unfortunately. I
have big problem discarding notion of dataframe support as part of the
fitting/search process for some areas of computational statistics.

Re: Traits for a mahout algorithm Library.

Posted by Trevor Grant <tr...@gmail.com>.
+1

The sklearn paradigm I think is awesome as an API, but I'm not looking to
make sklearn for Spark.  To Dmitriy's first point (correct me if I
extrapolating incorrectly), every underlying engine already has a SGD
Regression, K-Means, and a couple other standbys.  They take no time to
build, but why? If the user wants them, they can use them in the native
engine (or we can slap them in there just cause).

Let's (aim to) differentiate by providing useful algorithms not already
shipped standard in every other ML package on the block.

Another 'algorithm' that is used very widely in every industry I've been in
( Marketing and CPG ), that doesn't have a pleasant 'Big Data' solution is
hierarchical models (also called mix-models).  There's a bunch of other
'daily drivers' that everyone already use in R/SAS/ etc. that just don't
scale well, thus the rise of SGD, and Big Data algos.  Mahout is the ML
library for people who actually know math IMHO, in contrast to others that
are ML for computer scientists.  Let's expose some algorithms that single
node analysts know and are comfortable with.

So OLS isn't as efficient as SGD... so what.  An analyst can pick up
Mahout, and migrate their old methods into a distributed environment.
Further, they can see t-scores and f-scores and chi tests all those
statistics that everyone has come to know an love.  I think that would be a
huge win, as it erases this idea that if you're going to work in big data
you must abandon the old ways.

To Dmitriy's last point- the sklearn equivelent of that:
http://scikit-learn.org/stable/modules/grid_search.html

I agree 100%, it's something I truly miss about sklearn.  I'd support
implementing those 'everyone has one' algos from paragraph 1 if that was
the end goal.

Finally, re data-frames.  Why not leave it as vectors and matrices? That is
a more R-Like thing to do anyway.

val X: Matrix= data
val y: Vector = labels

model1.fit(X, y)

I don't mean to dominate the conversation, and I'm sorry- but I really
wanted to toss that idea re: hierarchical models out there, bc I know lots
of people who would love to have them, and it is the thing keeping them on
single core machines at the moment.

tg


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Thu, Jul 21, 2016 at 1:43 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> sk-learn learner, transformer and predictor features sound good to me,
> tried-and-proven
>
> most importantly imo we need strong established type system and not repeat
> what i view as a problem in some other offerings. If the type system is
> strict and limited in size, then there's much less need in data adapters,
> or none at all.
>
> so what we have :
> -- double precison tensor types (but not n-d arrays though)
> what we don't have:
> -- data frames
>
> What we may want to have
> -- formula support, especially for non-linear glm ("linear generalized
> linear", does this makes sense at all?) ok non-linear regressions
> formula normally acts on data-frame-y data, not on tensor data, albeit it
> produces tensor data. Herein lies a conundrum. I don't see mahout taking on
> data frames, this is just too big. but good formula and "factor" (in R
> sense) support is nice to have for down-to-earth problems.
>
> perhaps a tactical solution here is to integrate some foreign engine data
> frames but mahout native formula support. But i didn't give it much
> thought, because, although formulas and step-wise non-linear model searches
> are the first thing to happen to any analytics (but somehow it hasn't
> happened well enough elsewhere), i don't see how it can be made cheaply in
> engine-agnostic way. I still commonly view mahout as an under-funded
> project, so choices of new things should be smart -- small in volume, great
> in the bang. Dataframes are not small in the volume, esp. since i am
> increasingly turning away from Spark in my personal endeavors, so i won't
> support just integrating sparkql for this purpose.
>
> Big area that people actually need (IMO) and what hasn't been done well
> elsewhere (IMO) are model and model parameter searches. This "ML optimizer"
> idea that has been in AMPLab for as long as i remember them, and is still
> very popular, but I don't think there are good offers that actually solve
> this problem in OSS. One of the reasons, modern OSS is pretty slow for the
> volume required by the task. if we get some unique improvements to the
> framework, we can think of getting in this business. this shouldn't be that
> much difficult, assuming the throughput is not an issue. GPU clusters are
> increasingly common, we can hope we'll get there in the future.
>
> on algorithm side, i would love to see something with 2d inputs, cnns or
> something, for image processing.
>
>
>
>
> On Thu, Jul 21, 2016 at 8:08 AM, Trevor Grant <tr...@gmail.com>
> wrote:
>
> > I was thinking so too.  Most ML frameworks are at least loosly based on
> the
> > Sklearn paradigm.  For those not familiar, at a very abstract level-
> >
> > model1 = new Algo // e.g. K-Means, Random Forest, Neural Net
> >
> > model1.fit(trainingData)
> >
> > // then depending on the goal of the algorithm you have either (or both)
> > preds = model1.predict( testData)  // which returns a vector of
> predictions
> > for each obs point in testing data
> >
> > // or sometimes
> > newVals = model1.transform( testData) // which returns a new dataset like
> > object, as this makes more sense for things like neural nets, or when
> > you're not just predicting a single value per observation
> >
> >
> > In addition to the above, pre-processing operations then also have a
> > transform method such as
> >
> > preprocess1 = new Normalizer
> >
> > preprocess1.fit( trainingData )  // in this phase calculates the mean and
> > variance of the training data set
> >
> > preprocessedTrainingData = preprocess1.transform( trainingData)
> > preprocessTestingData = preprocess1.transform( testingData)
> >
> > I think this is a reasonalbe approach bc A) it makes sense and B) is a
> > standard of sorts across ML libraries (bc of A)
> >
> > We have two high level bucket types, based on what the output is:
> >
> > Predictors and Transformers
> >
> > Predictors: anything that return a single value per observation, this is
> > classifiers and regressors
> >
> > Transformers: anything that returns a vector per observation
> > - Pre-processing operations
> > - Classifiers, in that usually there is a probability vector for each
> > observation as to which class it belongs too, the 'predict' method then
> > just picks the most likely class
> > - Neural nets ( though with one small tweak can be extended to regression
> > or classification )
> > - Any unsupervised learning application (e.g. clustering)
> > - etc.
> >
> > And so really we have something like:
> >
> > class LearningFunction
> >   def fit()
> >
> > class Transformer extends LearningFunction:
> >   def transform
> >
> > class Predictor extends Transformer:
> >   def predict
> >
> >
> > This paradigm also lends its self nicely to pipelines...
> >
> > pipeline1 = new Pipeline
> >                    .add( transformer1 )
> >                    .add(  transformer2 )
> >                    .add( model1 )
> >
> > pipeline1.fit( trainingData )
> > pipelin1.predict( testingData )
> >
> > I have to read up on reccomenders a bit more to figure how those play in,
> > or if we need another class.
> >
> > In addition to that I think we would have an optimizers section that
> allows
> > for the various flavors of SGD, but also allows other types of optimizers
> > all together.
> >
> > Again, just moving the conversation forward a bit here.
> >
> > Excited to get to work on this
> >
> > Best,
> >
> > tg
> >
> >
> >
> >
> >
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Thu, Jul 21, 2016 at 7:13 AM, Sebastian <ss...@apache.org> wrote:
> >
> > > Hi Andrew,
> > >
> > > I think this topic is broader than just defining a few traits. A
> popular
> > > way of integrating ML algorithms is via the combination of dataframes
> and
> > > pipelines, similar to what scipy and SparkML are offering at the
> moment.
> > > Maybe it could make sense to integrate with what they have instead of
> > > starting our own efforts?
> > >
> > > Best,
> > > Sebastian
> > >
> > >
> > >
> > > On 21.07.2016 04:35, Andrew Palumbo wrote:
> > >
> > >> Hi All,
> > >>
> > >>
> > >> I'd like to draw your attention to MAHOUT-1856:
> > >> https://issues.apache.org/jira/browse/MAHOUT-1856
> > >>
> > >>
> > >> This is a discussion that has popped up several times over the last
> > >> couple of years. as we move towards building out our algorithm
> library,
> > It
> > >> would be great  to nail this down now.
> > >>
> > >>
> > >> Most Importantly to not be able to be criticized as "a loose bag of
> > >> algorithms" as we've sometimes been in the past.
> > >>
> > >>
> > >> The main point being It would be good to lay out  common traits for
> > >> Classification, Clustering, and Optimization algorithms.
> > >>
> > >>
> > >> This is just a start. I created this issue a few months back, and
> > >> intentionally left off Recommender, because I was unsure if there were
> > >> common traits across them.  By traits, I am referring to both both the
> > >> literal meaning and more specifically, actual Scala traits.
> > >>
> > >>
> > >> @pat, @tdunning, @ssc, could you give your thoughts on this?
> > >>
> > >>
> > >> As well, it would be good to add online flavors of different algorithm
> > >> classes into the mix.
> > >>
> > >>
> > >> @tdunning could you share some thoughts here?
> > >>
> > >>
> > >> Trevor Grant will be heading up this effort, and It would be great if
> we
> > >> all as a team could come up with abstract design plans for each class
> of
> > >> algorithm (as well as to determine the current "classes of
> algorithms",
> > as
> > >> each of us has our own unique blend of specializations.  And could
> give
> > our
> > >> thoughts on this.
> > >>
> > >>
> > >> Currently this is really the opening of the conversation.
> > >>
> > >>
> > >> It would be best to post thoughts on:
> > >> https://issues.apache.org/jira/browse/MAHOUT-1856
> > >>
> > >>
> > >> Any feedback is welcomed.
> > >>
> > >>
> > >> Thanks,
> > >>
> > >>
> > >> Andy
> > >>
> > >>
> > >>
> > >>
> >
>

Re: Traits for a mahout algorithm Library.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
sk-learn learner, transformer and predictor features sound good to me,
tried-and-proven

most importantly imo we need strong established type system and not repeat
what i view as a problem in some other offerings. If the type system is
strict and limited in size, then there's much less need in data adapters,
or none at all.

so what we have :
-- double precison tensor types (but not n-d arrays though)
what we don't have:
-- data frames

What we may want to have
-- formula support, especially for non-linear glm ("linear generalized
linear", does this makes sense at all?) ok non-linear regressions
formula normally acts on data-frame-y data, not on tensor data, albeit it
produces tensor data. Herein lies a conundrum. I don't see mahout taking on
data frames, this is just too big. but good formula and "factor" (in R
sense) support is nice to have for down-to-earth problems.

perhaps a tactical solution here is to integrate some foreign engine data
frames but mahout native formula support. But i didn't give it much
thought, because, although formulas and step-wise non-linear model searches
are the first thing to happen to any analytics (but somehow it hasn't
happened well enough elsewhere), i don't see how it can be made cheaply in
engine-agnostic way. I still commonly view mahout as an under-funded
project, so choices of new things should be smart -- small in volume, great
in the bang. Dataframes are not small in the volume, esp. since i am
increasingly turning away from Spark in my personal endeavors, so i won't
support just integrating sparkql for this purpose.

Big area that people actually need (IMO) and what hasn't been done well
elsewhere (IMO) are model and model parameter searches. This "ML optimizer"
idea that has been in AMPLab for as long as i remember them, and is still
very popular, but I don't think there are good offers that actually solve
this problem in OSS. One of the reasons, modern OSS is pretty slow for the
volume required by the task. if we get some unique improvements to the
framework, we can think of getting in this business. this shouldn't be that
much difficult, assuming the throughput is not an issue. GPU clusters are
increasingly common, we can hope we'll get there in the future.

on algorithm side, i would love to see something with 2d inputs, cnns or
something, for image processing.




On Thu, Jul 21, 2016 at 8:08 AM, Trevor Grant <tr...@gmail.com>
wrote:

> I was thinking so too.  Most ML frameworks are at least loosly based on the
> Sklearn paradigm.  For those not familiar, at a very abstract level-
>
> model1 = new Algo // e.g. K-Means, Random Forest, Neural Net
>
> model1.fit(trainingData)
>
> // then depending on the goal of the algorithm you have either (or both)
> preds = model1.predict( testData)  // which returns a vector of predictions
> for each obs point in testing data
>
> // or sometimes
> newVals = model1.transform( testData) // which returns a new dataset like
> object, as this makes more sense for things like neural nets, or when
> you're not just predicting a single value per observation
>
>
> In addition to the above, pre-processing operations then also have a
> transform method such as
>
> preprocess1 = new Normalizer
>
> preprocess1.fit( trainingData )  // in this phase calculates the mean and
> variance of the training data set
>
> preprocessedTrainingData = preprocess1.transform( trainingData)
> preprocessTestingData = preprocess1.transform( testingData)
>
> I think this is a reasonalbe approach bc A) it makes sense and B) is a
> standard of sorts across ML libraries (bc of A)
>
> We have two high level bucket types, based on what the output is:
>
> Predictors and Transformers
>
> Predictors: anything that return a single value per observation, this is
> classifiers and regressors
>
> Transformers: anything that returns a vector per observation
> - Pre-processing operations
> - Classifiers, in that usually there is a probability vector for each
> observation as to which class it belongs too, the 'predict' method then
> just picks the most likely class
> - Neural nets ( though with one small tweak can be extended to regression
> or classification )
> - Any unsupervised learning application (e.g. clustering)
> - etc.
>
> And so really we have something like:
>
> class LearningFunction
>   def fit()
>
> class Transformer extends LearningFunction:
>   def transform
>
> class Predictor extends Transformer:
>   def predict
>
>
> This paradigm also lends its self nicely to pipelines...
>
> pipeline1 = new Pipeline
>                    .add( transformer1 )
>                    .add(  transformer2 )
>                    .add( model1 )
>
> pipeline1.fit( trainingData )
> pipelin1.predict( testingData )
>
> I have to read up on reccomenders a bit more to figure how those play in,
> or if we need another class.
>
> In addition to that I think we would have an optimizers section that allows
> for the various flavors of SGD, but also allows other types of optimizers
> all together.
>
> Again, just moving the conversation forward a bit here.
>
> Excited to get to work on this
>
> Best,
>
> tg
>
>
>
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Thu, Jul 21, 2016 at 7:13 AM, Sebastian <ss...@apache.org> wrote:
>
> > Hi Andrew,
> >
> > I think this topic is broader than just defining a few traits. A popular
> > way of integrating ML algorithms is via the combination of dataframes and
> > pipelines, similar to what scipy and SparkML are offering at the moment.
> > Maybe it could make sense to integrate with what they have instead of
> > starting our own efforts?
> >
> > Best,
> > Sebastian
> >
> >
> >
> > On 21.07.2016 04:35, Andrew Palumbo wrote:
> >
> >> Hi All,
> >>
> >>
> >> I'd like to draw your attention to MAHOUT-1856:
> >> https://issues.apache.org/jira/browse/MAHOUT-1856
> >>
> >>
> >> This is a discussion that has popped up several times over the last
> >> couple of years. as we move towards building out our algorithm library,
> It
> >> would be great  to nail this down now.
> >>
> >>
> >> Most Importantly to not be able to be criticized as "a loose bag of
> >> algorithms" as we've sometimes been in the past.
> >>
> >>
> >> The main point being It would be good to lay out  common traits for
> >> Classification, Clustering, and Optimization algorithms.
> >>
> >>
> >> This is just a start. I created this issue a few months back, and
> >> intentionally left off Recommender, because I was unsure if there were
> >> common traits across them.  By traits, I am referring to both both the
> >> literal meaning and more specifically, actual Scala traits.
> >>
> >>
> >> @pat, @tdunning, @ssc, could you give your thoughts on this?
> >>
> >>
> >> As well, it would be good to add online flavors of different algorithm
> >> classes into the mix.
> >>
> >>
> >> @tdunning could you share some thoughts here?
> >>
> >>
> >> Trevor Grant will be heading up this effort, and It would be great if we
> >> all as a team could come up with abstract design plans for each class of
> >> algorithm (as well as to determine the current "classes of algorithms",
> as
> >> each of us has our own unique blend of specializations.  And could give
> our
> >> thoughts on this.
> >>
> >>
> >> Currently this is really the opening of the conversation.
> >>
> >>
> >> It would be best to post thoughts on:
> >> https://issues.apache.org/jira/browse/MAHOUT-1856
> >>
> >>
> >> Any feedback is welcomed.
> >>
> >>
> >> Thanks,
> >>
> >>
> >> Andy
> >>
> >>
> >>
> >>
>

Re: Traits for a mahout algorithm Library.

Posted by Trevor Grant <tr...@gmail.com>.
I was thinking so too.  Most ML frameworks are at least loosly based on the
Sklearn paradigm.  For those not familiar, at a very abstract level-

model1 = new Algo // e.g. K-Means, Random Forest, Neural Net

model1.fit(trainingData)

// then depending on the goal of the algorithm you have either (or both)
preds = model1.predict( testData)  // which returns a vector of predictions
for each obs point in testing data

// or sometimes
newVals = model1.transform( testData) // which returns a new dataset like
object, as this makes more sense for things like neural nets, or when
you're not just predicting a single value per observation


In addition to the above, pre-processing operations then also have a
transform method such as

preprocess1 = new Normalizer

preprocess1.fit( trainingData )  // in this phase calculates the mean and
variance of the training data set

preprocessedTrainingData = preprocess1.transform( trainingData)
preprocessTestingData = preprocess1.transform( testingData)

I think this is a reasonalbe approach bc A) it makes sense and B) is a
standard of sorts across ML libraries (bc of A)

We have two high level bucket types, based on what the output is:

Predictors and Transformers

Predictors: anything that return a single value per observation, this is
classifiers and regressors

Transformers: anything that returns a vector per observation
- Pre-processing operations
- Classifiers, in that usually there is a probability vector for each
observation as to which class it belongs too, the 'predict' method then
just picks the most likely class
- Neural nets ( though with one small tweak can be extended to regression
or classification )
- Any unsupervised learning application (e.g. clustering)
- etc.

And so really we have something like:

class LearningFunction
  def fit()

class Transformer extends LearningFunction:
  def transform

class Predictor extends Transformer:
  def predict


This paradigm also lends its self nicely to pipelines...

pipeline1 = new Pipeline
                   .add( transformer1 )
                   .add(  transformer2 )
                   .add( model1 )

pipeline1.fit( trainingData )
pipelin1.predict( testingData )

I have to read up on reccomenders a bit more to figure how those play in,
or if we need another class.

In addition to that I think we would have an optimizers section that allows
for the various flavors of SGD, but also allows other types of optimizers
all together.

Again, just moving the conversation forward a bit here.

Excited to get to work on this

Best,

tg






Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Thu, Jul 21, 2016 at 7:13 AM, Sebastian <ss...@apache.org> wrote:

> Hi Andrew,
>
> I think this topic is broader than just defining a few traits. A popular
> way of integrating ML algorithms is via the combination of dataframes and
> pipelines, similar to what scipy and SparkML are offering at the moment.
> Maybe it could make sense to integrate with what they have instead of
> starting our own efforts?
>
> Best,
> Sebastian
>
>
>
> On 21.07.2016 04:35, Andrew Palumbo wrote:
>
>> Hi All,
>>
>>
>> I'd like to draw your attention to MAHOUT-1856:
>> https://issues.apache.org/jira/browse/MAHOUT-1856
>>
>>
>> This is a discussion that has popped up several times over the last
>> couple of years. as we move towards building out our algorithm library, It
>> would be great  to nail this down now.
>>
>>
>> Most Importantly to not be able to be criticized as "a loose bag of
>> algorithms" as we've sometimes been in the past.
>>
>>
>> The main point being It would be good to lay out  common traits for
>> Classification, Clustering, and Optimization algorithms.
>>
>>
>> This is just a start. I created this issue a few months back, and
>> intentionally left off Recommender, because I was unsure if there were
>> common traits across them.  By traits, I am referring to both both the
>> literal meaning and more specifically, actual Scala traits.
>>
>>
>> @pat, @tdunning, @ssc, could you give your thoughts on this?
>>
>>
>> As well, it would be good to add online flavors of different algorithm
>> classes into the mix.
>>
>>
>> @tdunning could you share some thoughts here?
>>
>>
>> Trevor Grant will be heading up this effort, and It would be great if we
>> all as a team could come up with abstract design plans for each class of
>> algorithm (as well as to determine the current "classes of algorithms", as
>> each of us has our own unique blend of specializations.  And could give our
>> thoughts on this.
>>
>>
>> Currently this is really the opening of the conversation.
>>
>>
>> It would be best to post thoughts on:
>> https://issues.apache.org/jira/browse/MAHOUT-1856
>>
>>
>> Any feedback is welcomed.
>>
>>
>> Thanks,
>>
>>
>> Andy
>>
>>
>>
>>

Re: Traits for a mahout algorithm Library.

Posted by Sebastian <ss...@apache.org>.
Hi Andrew,

I think this topic is broader than just defining a few traits. A popular 
  way of integrating ML algorithms is via the combination of dataframes 
and pipelines, similar to what scipy and SparkML are offering at the 
moment. Maybe it could make sense to integrate with what they have 
instead of starting our own efforts?

Best,
Sebastian



On 21.07.2016 04:35, Andrew Palumbo wrote:
> Hi All,
>
>
> I'd like to draw your attention to MAHOUT-1856:  https://issues.apache.org/jira/browse/MAHOUT-1856
>
>
> This is a discussion that has popped up several times over the last couple of years. as we move towards building out our algorithm library, It would be great  to nail this down now.
>
>
> Most Importantly to not be able to be criticized as "a loose bag of algorithms" as we've sometimes been in the past.
>
>
> The main point being It would be good to lay out  common traits for Classification, Clustering, and Optimization algorithms.
>
>
> This is just a start. I created this issue a few months back, and intentionally left off Recommender, because I was unsure if there were common traits across them.  By traits, I am referring to both both the literal meaning and more specifically, actual Scala traits.
>
>
> @pat, @tdunning, @ssc, could you give your thoughts on this?
>
>
> As well, it would be good to add online flavors of different algorithm classes into the mix.
>
>
> @tdunning could you share some thoughts here?
>
>
> Trevor Grant will be heading up this effort, and It would be great if we all as a team could come up with abstract design plans for each class of algorithm (as well as to determine the current "classes of algorithms", as each of us has our own unique blend of specializations.  And could give our thoughts on this.
>
>
> Currently this is really the opening of the conversation.
>
>
> It would be best to post thoughts on: https://issues.apache.org/jira/browse/MAHOUT-1856
>
>
> Any feedback is welcomed.
>
>
> Thanks,
>
>
> Andy
>
>
>