You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2011/05/15 19:09:31 UTC

Interesting MapReduce variant: MapFreeduce

Hi all, in my travels I've come across a small interesting startup that I
thought might be of interest to the user@ audience. It's MapFreeduce (
http://mapfreeduce.com/), and they're spinning an interesting twist on
MapReduce. They've constructed a simplified MapReduce API, one for which
workers are able to run as Java applets in the browser sandbox.

It's interesting for two reasons, I can tell you, after playing with it
myself. One, I think it's interesting as it asks whether a simpler version
of MapReduce than what you get in Hadoop is viable. That is -- it's not
Hadoop. Can you do something interesting without, say, direct access to
HDFS? Combiners? custom InputFormats? And two, since it can fairly
automatically turn office PCs with a browser into a safe background MR
worker, might let organizational skunk-works create a cluster for cheap out
of truly unused cycles to do something interesting.

I managed to reconstruct parts of the recommender pipeline on this framework
without too much modification. It is possible to 'port' some parts of Mahout
to this framework, if not all. MapReduce fans will probably enjoy taking a
look at what they can get away with in a browser sandbox.

>From a conversation with their founder I know they'd really like feedback
and testers. Here's their pitch and plea for beta users in their own words.
(I have no affiliation with or interest in the company.)


*"MapFreeduce.com is a Washington DC-based startup making Big Data
accessible to everyone. Our software service enables users to quickly and
easily build a mapreduce cluster from the spare CPU-cycles of available
computers without installing or configuring any software. To add a node to
your MapFreeduce cluster and increase its power, you simply click on a link
from any idle computer. You can scale your cluster to thousands of nodes to
perform computation- and data-intensive tasks such as web indexing, data
mining, business analytics, data warehousing, machine learning, financial
analysis, scientific simulation, and bioinformatics research. MapFreeduce
allows you to focus on crunching your data without having to worry about
either the cost and complexity of setting up a traditional hardware cluster
or the perpetual fees charged per hour and per node by common cloud
providers.

We are looking for individuals that would be interested in joining our free,
private beta test and/or providing feedback to our service."*

Re: Interesting MapReduce variant: MapFreeduce

Posted by Ted Dunning <te...@gmail.com>.
Most of our jobs are I/O bound anyway and it is common for the switch fabric
connecting desktops to be pretty limited.  My guess is that you would get
very limited increase in total computing progress by these means.  There are
a few notable examples like protein folding where the problems require small
input and output and massive compute time, but very few distributed machine
learning algorithms are like that.

On Sun, May 15, 2011 at 10:30 AM, Jeremy Lewi <je...@lewi.us> wrote:

> If you're running in an applet without hdfs, doesn't that mean "your
> moving both data and computation to the machine" as opposed to moving
> "computation to the data?". Would this be a big issue for mahout? For
> example,  if you're running kmeans and 90% of your machines are
> workstations that would otherwise be idle, then wouldn't you need to
> transfer roughly 90% of your dataset to the various clients (e.g client
> might only receive a small fraction but you 90% needs to be shipped out
> of your central storage)? It seems like network bottlenecks could easily
> swamp the benefits of using workstation cycles.
>

Re: Interesting MapReduce variant: MapFreeduce

Posted by Sean Owen <sr...@gmail.com>.
Yeah as I understand it has to stream data to and from the worker as the
sandbox allows no access to the file system or network (other than the
originating host). On the plus side -- limits the damage this can do to a
user's PC.

And yes this strikes me as one of the key issues with the model. It works OK
for smallish jobs or those with more CPU-intensive nature than I/O. I think
this grew out of a distributed computing technology built to handle
BOINC-style physics simulations, indeed.

It's not going to be a good model for a lot of problems -- it's cool enough
to warrant thinking about what it might be good for. If you can afford
long-running jobs that throttle network usage and all that, could be a
cheap-o way for a small organization to do something interesting with
MapReduce.


On Sun, May 15, 2011 at 6:30 PM, Jeremy Lewi <je...@lewi.us> wrote:

> Thanks for the link Sean.
>
> Whenever I looked into recovering wasted compute cycles (e.g by letting
> a job scheduler like sun grid engine fire off jobs during downtime) we
> found that the hassle of administering such a heterogeneous environment
> wasn't worth it. Maybe running as an applet under hadoop, and the
> implied virtual environment will make that easier.
>
> If you're running in an applet without hdfs, doesn't that mean "your
> moving both data and computation to the machine" as opposed to moving
> "computation to the data?". Would this be a big issue for mahout? For
> example,  if you're running kmeans and 90% of your machines are
> workstations that would otherwise be idle, then wouldn't you need to
> transfer roughly 90% of your dataset to the various clients (e.g client
> might only receive a small fraction but you 90% needs to be shipped out
> of your central storage)? It seems like network bottlenecks could easily
> swamp the benefits of using workstation cycles.
>
> J
>
> On Sun, 2011-05-15 at 18:09 +0100, Sean Owen wrote:
> > Hi all, in my travels I've come across a small interesting startup that I
> > thought might be of interest to the user@ audience. It's MapFreeduce (
> > http://mapfreeduce.com/), and they're spinning an interesting twist on
> > MapReduce. They've constructed a simplified MapReduce API, one for which
> > workers are able to run as Java applets in the browser sandbox.
> >
> > It's interesting for two reasons, I can tell you, after playing with it
> > myself. One, I think it's interesting as it asks whether a simpler
> version
> > of MapReduce than what you get in Hadoop is viable. That is -- it's not
> > Hadoop. Can you do something interesting without, say, direct access to
> > HDFS? Combiners? custom InputFormats? And two, since it can fairly
> > automatically turn office PCs with a browser into a safe background MR
> > worker, might let organizational skunk-works create a cluster for cheap
> out
> > of truly unused cycles to do something interesting.
> >
> > I managed to reconstruct parts of the recommender pipeline on this
> framework
> > without too much modification. It is possible to 'port' some parts of
> Mahout
> > to this framework, if not all. MapReduce fans will probably enjoy taking
> a
> > look at what they can get away with in a browser sandbox.
> >
> > From a conversation with their founder I know they'd really like feedback
> > and testers. Here's their pitch and plea for beta users in their own
> words.
> > (I have no affiliation with or interest in the company.)
> >
> >
> > *"MapFreeduce.com is a Washington DC-based startup making Big Data
> > accessible to everyone. Our software service enables users to quickly and
> > easily build a mapreduce cluster from the spare CPU-cycles of available
> > computers without installing or configuring any software. To add a node
> to
> > your MapFreeduce cluster and increase its power, you simply click on a
> link
> > from any idle computer. You can scale your cluster to thousands of nodes
> to
> > perform computation- and data-intensive tasks such as web indexing, data
> > mining, business analytics, data warehousing, machine learning, financial
> > analysis, scientific simulation, and bioinformatics research. MapFreeduce
> > allows you to focus on crunching your data without having to worry about
> > either the cost and complexity of setting up a traditional hardware
> cluster
> > or the perpetual fees charged per hour and per node by common cloud
> > providers.
> >
> > We are looking for individuals that would be interested in joining our
> free,
> > private beta test and/or providing feedback to our service."*
>
>

Re: Interesting MapReduce variant: MapFreeduce

Posted by Jeremy Lewi <je...@lewi.us>.
Thanks for the link Sean.

Whenever I looked into recovering wasted compute cycles (e.g by letting
a job scheduler like sun grid engine fire off jobs during downtime) we
found that the hassle of administering such a heterogeneous environment
wasn't worth it. Maybe running as an applet under hadoop, and the
implied virtual environment will make that easier.

If you're running in an applet without hdfs, doesn't that mean "your
moving both data and computation to the machine" as opposed to moving
"computation to the data?". Would this be a big issue for mahout? For
example,  if you're running kmeans and 90% of your machines are
workstations that would otherwise be idle, then wouldn't you need to
transfer roughly 90% of your dataset to the various clients (e.g client
might only receive a small fraction but you 90% needs to be shipped out
of your central storage)? It seems like network bottlenecks could easily
swamp the benefits of using workstation cycles.

J

On Sun, 2011-05-15 at 18:09 +0100, Sean Owen wrote:
> Hi all, in my travels I've come across a small interesting startup that I
> thought might be of interest to the user@ audience. It's MapFreeduce (
> http://mapfreeduce.com/), and they're spinning an interesting twist on
> MapReduce. They've constructed a simplified MapReduce API, one for which
> workers are able to run as Java applets in the browser sandbox.
> 
> It's interesting for two reasons, I can tell you, after playing with it
> myself. One, I think it's interesting as it asks whether a simpler version
> of MapReduce than what you get in Hadoop is viable. That is -- it's not
> Hadoop. Can you do something interesting without, say, direct access to
> HDFS? Combiners? custom InputFormats? And two, since it can fairly
> automatically turn office PCs with a browser into a safe background MR
> worker, might let organizational skunk-works create a cluster for cheap out
> of truly unused cycles to do something interesting.
> 
> I managed to reconstruct parts of the recommender pipeline on this framework
> without too much modification. It is possible to 'port' some parts of Mahout
> to this framework, if not all. MapReduce fans will probably enjoy taking a
> look at what they can get away with in a browser sandbox.
> 
> From a conversation with their founder I know they'd really like feedback
> and testers. Here's their pitch and plea for beta users in their own words.
> (I have no affiliation with or interest in the company.)
> 
> 
> *"MapFreeduce.com is a Washington DC-based startup making Big Data
> accessible to everyone. Our software service enables users to quickly and
> easily build a mapreduce cluster from the spare CPU-cycles of available
> computers without installing or configuring any software. To add a node to
> your MapFreeduce cluster and increase its power, you simply click on a link
> from any idle computer. You can scale your cluster to thousands of nodes to
> perform computation- and data-intensive tasks such as web indexing, data
> mining, business analytics, data warehousing, machine learning, financial
> analysis, scientific simulation, and bioinformatics research. MapFreeduce
> allows you to focus on crunching your data without having to worry about
> either the cost and complexity of setting up a traditional hardware cluster
> or the perpetual fees charged per hour and per node by common cloud
> providers.
> 
> We are looking for individuals that would be interested in joining our free,
> private beta test and/or providing feedback to our service."*