You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by Josh Wills <jw...@cloudera.com> on 2013/03/13 09:47:04 UTC

Fwd: Discussion Of ML environment/MR, Mahout

There's an interesting discussion going on over on the Mahout mailing list
about the future of the project and whether they should keep developing
algorithms for MapReduce or branch out to creating models on (e.g.) Spark.
I passed along the following thoughts, which I've been playing around with
in my head for a couple of months.

---------- Forwarded message ----------
From: Josh Wills <jw...@cloudera.com>
Date: Tue, Mar 12, 2013 at 3:05 PM
Subject: Re: Discussion Of ML environment/MR, Mahout
To: dev@mahout.apache.org


Hey Mahout Devs,

Sean clued me in to the discussion you all were having about porting some
algorithms over to Spark, and that overlapped with some things I had been
thinking about lately and wanted to get your input on.

(I think I know everyone on this thread either personally or by
reputation-- for those of you I haven't met in person, it's a pleasure to
virtually meet you, and I hope you'll forgive the long-ish intrusion.)

First, I wanted to say that I think that there are lots of problems that
can be handled well in MapReduce (the recent k-means streaming stuff being
a prime example), even if they could be performed even faster using an
in-memory model. The question for me is usually one of resource allocation:
we're usually sharing our Hadoop clusters with lots of other jobs, and
building a model as a MapReduce pipeline is usually a good way to play
nicely w/everyone else and not take up too much memory. Of course, there
are some models where the performance benefit I'll get from fitting the
model quickly in memory are totally worth the resource costs.

Somewhat related, I would really like a seamless way to perform an analysis
on a small-ish (e.g., a few hundred gigs) dataset in memory and then take
the same code that I used to perform that analysis and scale it out to run
over a few terabytes in batch mode.

I'm wondering if we could solve both problems by creating a wrapper API
that would look a lot like the Spark RDD API and then created
implementations of that API via Spark/Giraph/Crunch (truly shameless
promotion: http://crunch.apache.org/ ). That way, the same model could be
run in-memory against Spark or in batch via a series of MapReduce jobs (or
a BSP job, or a Tez pipeline, or whatever execution framework is written
next week.) The main virtue of Crunch in this regard is that the data model
is very similar to Spark's (truth be told, I used the Spark API as a
reference when I was originally creating Crunch's Scala API, "Scrunch")--
the whole idea of distributed collections of "things," where things can be
anything you want them to be (e.g., Mahout Vectors.)

I have a set of Crunch + Mahout tools I use for my own data preparation
tasks (e.g., things like turning a CSV file of numerical and categorical
variables into a SequenceFile of normalized Mahout Vectors w/indicator
variables before running things like Mahout's SSVD model), so I think we
could make a model work where we could get the performance benefits of new
frameworks w/o having to leave MR1 behind completely.

I don't have an opinion on the structure of such an effort (via the
incubator or otherwise), but I thought I would throw the idea out there, as
it's something I would definitely like to be involved in.

Best,
Josh
-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>