You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Robert Dodier <ro...@gmail.com> on 2015/03/04 02:51:25 UTC

ideas for MLlib development

Hi,

I have some ideas for MLlib that I think might be of general interest
so I'd like to see what people think and maybe find some collaborators.

(1) Some form of Markov chain Monte Carlo such as Gibbs sampling
or Metropolis-Hastings. Any kind of Monte Carlo method is readily
parallelized so Spark seems like a natural platform for them.
MCMC plays an important role in computational implementations
of Bayesian inference.

(2) A function to compute the calibration of a probabilistic classifier.
The question this answers is, if the classifier outputs 0.x for some
group of examples, is the actual proportion approximately 0.x ?
This is useful to know if the classifier outputs are used to compute
expected loss in some decision procedure.

Of course (1) is much bigger than (2). Perhaps (2) is a one-person
job but (1) will take a lot of teamwork. I am thinking that in the short
term, we could at least make some progress on an outline or
framework for (1).

I am a newcomer to Scala and Spark but I have a lot of experience
in statistical computing. I am thinking that maybe one or the other
of these projects will be a good way for me to learn more about
Spark and make a useful contribution. Thanks for your interest
and I look forward to your comments.

Robert Dodier

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: ideas for MLlib development

Posted by Robert Dodier <ro...@gmail.com>.

Thanks for your reply, Evan.

> It may make sense to have a more general Gibbs sampling
> framework, but it might be good to have a few desired applications
> in mind (e.g. higher level models that rely on Gibbs) to help API
> design, parallelization strategy, etc.

I think I'm more interested in a general framework which could
be applied to a variety of models, as opposed to an implementation
tailored to a specific model such as LDA. I'm thinking that such
a framework could be used in model exploration, either as an
end in itself or perhaps to identify promising models that could
then be given optimized, custom implementations. This would
be very much in the spirit of existing packages such as BUGS.
In fact, if we were to go down this road, I would propose that
models be specified in the BUGS modeling language -- no need
to reinvent that wheel, I would say.

At a very high level, the API for this framework would specify
methods to compute conditional distributions, marginalizing
as necessary via MCMC. Other operations could include
computing the expected value of a variable or function.
All this is very reminiscent of BUGS, of course.

best,

Robert Dodier

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: ideas for MLlib development

Posted by "Evan R. Sparks" <ev...@gmail.com>.

Hi Robert,

There's some work to do LDA via Gibbs sampling in this JIRA:
https://issues.apache.org/jira/browse/SPARK-1405 as well as this one:
https://issues.apache.org/jira/browse/SPARK-5556

It may make sense to have a more general Gibbs sampling framework, but it
might be good to have a few desired applications in mind (e.g. higher level
models that rely on Gibbs) to help API design, parallelization strategy,
etc.

See the guide (
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingNewAlgorithmstoMLLib)
for information about contributing to MLlib.

- Evan

On Tue, Mar 3, 2015 at 5:51 PM, Robert Dodier <ro...@gmail.com>
wrote:

> Hi,
>
> I have some ideas for MLlib that I think might be of general interest
> so I'd like to see what people think and maybe find some collaborators.
>
> (1) Some form of Markov chain Monte Carlo such as Gibbs sampling
> or Metropolis-Hastings. Any kind of Monte Carlo method is readily
> parallelized so Spark seems like a natural platform for them.
> MCMC plays an important role in computational implementations
> of Bayesian inference.


> (2) A function to compute the calibration of a probabilistic classifier.
> The question this answers is, if the classifier outputs 0.x for some
> group of examples, is the actual proportion approximately 0.x ?
> This is useful to know if the classifier outputs are used to compute
> expected loss in some decision procedure.
>
> Of course (1) is much bigger than (2). Perhaps (2) is a one-person
> job but (1) will take a lot of teamwork. I am thinking that in the short
> term, we could at least make some progress on an outline or
> framework for (1).
>
> I am a newcomer to Scala and Spark but I have a lot of experience
> in statistical computing. I am thinking that maybe one or the other
> of these projects will be a good way for me to learn more about
> Spark and make a useful contribution. Thanks for your interest
> and I look forward to your comments.
>
> Robert Dodier
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>