You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Trevor Grant <tr...@gmail.com> on 2017/06/05 20:09:14 UTC
Re: Samsara's learning curve

Fwiw-

I think I'm about 10 hours into multi layer perceptrons, maybe another 2 to
go for docs and last unit tests.  Could have been quicker but I already
have follow on things I want to do, and am building them so that it will be
easily extendable (to LSTMs, convolution nets, etc). If I had taken some
short cuts- could have been done probably in 5-7, and a large part of that
is remembering how back-propegations works, and getting lost in my own
indices.



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Wed, Mar 29, 2017 at 11:26 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> While I agree with D and T, I’ll add a few things to watch out for.
>
> One of the hardest things to learn is the new model of execution, it’s not
> quite Spark or any other compute engine. You need to create contexts that
> have virtualized the actual compute engine. But you will probably need to
> use the actual compute engine too. Switching back and forth is fairly
> simple but must be learned and could be documented better.
>
> The other missing bit is dataframes. R and Spark have them in different
> forms but Mahout largely ignores the issue of real world object ids. Again
> not vey hard to work around and here’s hoping it's added in a future rev.
>
>
> On Mar 27, 2017, at 1:38 PM, Trevor Grant <tr...@gmail.com>
> wrote:
>
> I tend to agree with D.
>
> For example, I set out to do the 'Eigenfaces problem' last year, and wrote
> a blog on it.  It ended up being about 4 lines of Samsara code (+ imports),
> the "hardest" part was loading images into vectors, and then vectors back
> into images (wasn't awful, but I was new to Scala).  In addition to the
> modest marketing and a lack of introductory tutorials, is that to really
> use Mahout-Samsara in the first place you need to have a fairly good grasp
> of linear algebra, which gives it significantly less mass-appeal than say
> an mllib/sklearn/etc. Your
> I-just-got-my-data-science-certificate-from-coursera data scientists
> simply
> aren't equipped to use Mahout.  Your advanced-R-type data scientists can
> use it- but unless they have a problem that is to big for a single machine,
> have no motivation to use it (may change with native solvers, more
> algorithms, etc), and even given motivation the question then becomes learn
> Mahout OR come up with a clever trick for being able to stay in a single
> machine.
>
> But yea- a fairly easy and pleasant framework.  If you have the proper
> motivation, there is simply nothing else like it.
>
> tg
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Mon, Mar 27, 2017 at 12:32 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > I believe writing in the DSL is simple enough, especially if you have
> some
> > familiarity with Scala on top of R (or, in my case, R on top of Scala
> > perhaps:). I've implemented about couple dozens customized algorithms
> that
> > used distributed Samsara algebra at least to some degree, and I think I
> can
> > reliably attest none of them ever exceeded 100 lines or so, and that it
> > significantly reduced my time dedicated to writing algebra on top of
> Spark
> > and some other backends I use under proprietary settings. I am now mostly
> > doing non-algebraic improvements because writing algebra is easy.
> >
> > The most difficult part however, at least for me, and as you can see as
> you
> > go along with the  book, was not the pecularities of R-like bindings, but
> > the algorithm reformulations. Traditional "in-memory" algorithms do not
> > work on shared-nothing backends, even though you could program them, they
> > simply will not perform.
> >
> > The main reasons some of the traditional algorithms do not work at scale
> > are because they either require random memory access, or (more often) are
> > simply super-linear w.r.t. input size, so as one scales  infrastructure
> at
> > linear cost, one would still incur less than expected increment in
> > performance (if any at all, at some point) per unit of input.
> >
> > Hence, usually some mathematically, or should i say, statistically
> > motivated tricks are still required. As the book describes, linearly or
> > sub-linearly scalable sketches, random projections, dimensionality
> > reductions etc. etc. are required to alleviate scalability issues of the
> > super-linear algorithms.
> >
> > To your question, i got couple of people doing some pieces on various
> > projects before with Samsara, but they had me as a coworker. I am
> > personally not aware of any outside developers beyond people already on
> the
> > project @ Apache and my co-workers, although in all honesty i feel it has
> > to do more with maturity and modest marketing of the public version of
> > Samsara than necessarily the difficulty of adoption.
> >
> > -d
> >
> >
> >
> > On Sun, Mar 26, 2017 at 9:15 AM, Gustavo Frederico <
> > gustavo.frederico@thinkwrap.com> wrote:
> >
> >> I read Lyubimov's and Palumbo's book on Mahout Samsara up to chapter 4
> >> ( Distributed Algebra ). I have some familiarity with R, I did study
> >> linear algebra and calculus in undergrad. In my master's I studied
> >> statistical pattern recognition and researched a number of ML
> >> algorithms in my thesis - spending more time on SVMs. This is to ask:
> >> what is the learning curve of Samsara? How complicated is to work with
> >> distributed algebra to create an algorithm? Can someone share an
> >> example of how long she/he took to go from algorithm conception to
> >> implementation?
> >>
> >> Thanks
> >>
> >> Gustavo
> >>
> >
>
>