You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Bob Futrelle <bo...@gmail.com> on 2007/12/05 04:12:33 UTC

Comparing Hadoop to Apple Xgrid?

We have a small Apple cluster running Hadoop. But another option we
have, built into the Apple Server OS, is to use their Xgrid, which
they promote for "supercomputer" scientific applications.

Any thoughts on the relative merits?  I know that it depends on the
application.  For us, we want to do pattern recognition, turning
raster images into collections of the objects we discover in the
images. Another focus for us is NLP, esp. phrasal analysis.

 - Bob Futrelle

Re: Comparing Hadoop to Apple Xgrid?

Posted by Bob Futrelle <bo...@gmail.com>.

For the record, here's the Apple Xgrid (hype?) page:

http://www.apple.com/server/macosx/technology/xgrid.html

 - Bob

On Dec 5, 2007 5:04 AM, Bob Futrelle <bo...@gmail.com> wrote:
> You've written a spirited statement about the strengths of hadoop.
> But I'd still be interested in hearing from someone who might
> understand why an Xgrid cluster with its attendant.management system
> would or would not be equally good for these problems. After all,
> there are a reasonable number of Xgrid customers who are getting their
> work done.
>
> Maybe I'll need to learn more about both and also engage in some
> discussions with the Xgrid community. I do intend to bring up the
> Xgrid system on our cluster to see how it works for us.  That'll
> certainly deepen my understanding of both.
>
> Thanks for the detailed reply.
>
>  - Bob
>
>
>
> On Dec 5, 2007 12:17 AM, Ted Dunning <td...@veoh.com> wrote:
> >
> > IF you are looking at large numbers of independent images then hadoop should
> > be close to perfect for this analysis (the problem is embarrassingly
> > parallel).  If you are looking at video, then you can still do quite well by
> > building what is essentially a probabilistic list of recognized items in the
> > video stream in the map phase, giving all frames from a single shot the same
> > reduce key.  Then in the reduce phase, you can correlate the possible
> > objects and their probabilities according to object persistence models.  It
> > would be good to do another pass after that to do scene to scene
> > correlations.  This formulation gives you near perfect parallelism as well.
> >
> > For NLP, the problem at the level of phrasal analysis can also be made
> > trivially parallel if you have large numbers of documents.  Again, you may
> > need to do a secondary pass to find duplicated references across multiple
> > documents but this is usually far less intensive than the original analysis.
> >
> > Standard scientific HPC architectures are all about facilitating arbitrary
> > communication patterns and process boundaries.  This is exceedingly hard to
> > do really well and few systems attain really good performance.  Hadoop is
> > all about working with a really simple primitive that is so simple that it
> > can be implemented really well with simple and cheap hardware.  What is
> > surprising (a bit) is that so many problems can be well expressed as
> > map-reduce programs.  Sometimes this is only true at really large scale
> > where correlations become small (allowing the map phase to do useful work on
> > many sub-units), sometimes it requires relatively large intermediate data
> > (such as many graph algorithms).  The fact is, however, that it works
> > remarkably well.
> >
> >
> > On 12/4/07 7:12 PM, "Bob Futrelle" <bo...@gmail.com> wrote:
> >
> > > For us, we want to do pattern recognition, turning
> > > raster images into collections of the objects we discover in the
> > > images. Another focus for us is NLP, esp. phrasal analysis.
> >
> >
>

Re: Comparing Hadoop to Apple Xgrid?

Posted by Bob Futrelle <bo...@gmail.com>.

All this feedback is informative and valuable -- Thanks!

 - Bob Futrelle
   Northeastern U.


On Dec 5, 2007 12:55 PM, Ted Dunning <td...@veoh.com> wrote:
>
> I just read the xgrid page and it is clear that apple has pushed on the
> following parameters (they may be doing lots of other cool stuff that I
> don't know about):
>
> A) auto-configuration
> B) wider distribution of computation
> C) local checkpointing of processes for restarts
>
> What they have apparently not done includes
>
> X) map/reduce
> Y) magic process restarts in the face of failure (see map/reduce)
> Z) distributed file system
>
> When newbies try to run hadoop the ALWAYS seem to run head-long into the
> lack of (A) (how many times has somebody essentially said "I have a totally
> screwed up DNS and hadoop won't run"?).
>
> Item (B) is probably a bad thing for hadoop given the bandwidth required for
> the shuffle phase.
>
> Item (C) is inherent in map-reduce and is pretty neutral either way.
>
>
>
>
>
> On 12/5/07 9:23 AM, "Ted Dunning" <td...@veoh.com> wrote:
>
> >
> >
> > Sorry about not addressing this. (and I appreciate your gentle prod)
> >
> > The Xgrid would likely work well on these problems.  They are, after all,
> > nearly trivial to parallelize because of clean communication patterns.
> >
> > Consider an alternative problem of solving n-body gravitational dynamics for
> > n > 10^6 bodies.  Here there is nearly universal communication.
> >
> > As another example, last week I heard from some Sun engineers that one of
> > their HPC systems had to satisfy a requirement for checkpointing large
> > numerical computations in which a large number of computational nodes were
> > required to dump 10's of TB of checkpoint data to disk in less than 10
> > seconds.
> >
> > Finally, many of these HPC systems are designed to fit the entire working
> > set into memory so that high numerical computational throughput can be
> > maintained.  In this regime, communications have to work on memory
> > time-scales rather than disk time-scales.
> >
> > None of these three example problems are very suitable for Hadoop.
> >
> > The sample problems you gave are a different matter.
> >
> >
> > On 12/5/07 2:04 AM, "Bob Futrelle" <bo...@gmail.com> wrote:
> >
> >> why an Xgrid cluster with its attendant.management system
> >> would or would not be equally good for these problems
> >
>
>

Re: Comparing Hadoop to Apple Xgrid?

Posted by Ted Dunning <td...@veoh.com>.

I just read the xgrid page and it is clear that apple has pushed on the
following parameters (they may be doing lots of other cool stuff that I
don't know about):

A) auto-configuration
B) wider distribution of computation
C) local checkpointing of processes for restarts

What they have apparently not done includes

X) map/reduce
Y) magic process restarts in the face of failure (see map/reduce)
Z) distributed file system

When newbies try to run hadoop the ALWAYS seem to run head-long into the
lack of (A) (how many times has somebody essentially said "I have a totally
screwed up DNS and hadoop won't run"?).

Item (B) is probably a bad thing for hadoop given the bandwidth required for
the shuffle phase.

Item (C) is inherent in map-reduce and is pretty neutral either way.




On 12/5/07 9:23 AM, "Ted Dunning" <td...@veoh.com> wrote:

> 
> 
> Sorry about not addressing this. (and I appreciate your gentle prod)
> 
> The Xgrid would likely work well on these problems.  They are, after all,
> nearly trivial to parallelize because of clean communication patterns.
> 
> Consider an alternative problem of solving n-body gravitational dynamics for
> n > 10^6 bodies.  Here there is nearly universal communication.
> 
> As another example, last week I heard from some Sun engineers that one of
> their HPC systems had to satisfy a requirement for checkpointing large
> numerical computations in which a large number of computational nodes were
> required to dump 10's of TB of checkpoint data to disk in less than 10
> seconds.
> 
> Finally, many of these HPC systems are designed to fit the entire working
> set into memory so that high numerical computational throughput can be
> maintained.  In this regime, communications have to work on memory
> time-scales rather than disk time-scales.
> 
> None of these three example problems are very suitable for Hadoop.
> 
> The sample problems you gave are a different matter.
> 
> 
> On 12/5/07 2:04 AM, "Bob Futrelle" <bo...@gmail.com> wrote:
> 
>> why an Xgrid cluster with its attendant.management system
>> would or would not be equally good for these problems
>

Re: Comparing Hadoop to Apple Xgrid?

Posted by Ted Dunning <td...@veoh.com>.

Sorry about not addressing this. (and I appreciate your gentle prod)

The Xgrid would likely work well on these problems.  They are, after all,
nearly trivial to parallelize because of clean communication patterns.

Consider an alternative problem of solving n-body gravitational dynamics for
n > 10^6 bodies.  Here there is nearly universal communication.

As another example, last week I heard from some Sun engineers that one of
their HPC systems had to satisfy a requirement for checkpointing large
numerical computations in which a large number of computational nodes were
required to dump 10's of TB of checkpoint data to disk in less than 10
seconds.

Finally, many of these HPC systems are designed to fit the entire working
set into memory so that high numerical computational throughput can be
maintained.  In this regime, communications have to work on memory
time-scales rather than disk time-scales.

None of these three example problems are very suitable for Hadoop.

The sample problems you gave are a different matter.

On 12/5/07 2:04 AM, "Bob Futrelle" <bo...@gmail.com> wrote:

> why an Xgrid cluster with its attendant.management system
> would or would not be equally good for these problems

Re: Comparing Hadoop to Apple Xgrid?

Posted by Bob Futrelle <bo...@gmail.com>.

You've written a spirited statement about the strengths of hadoop.
But I'd still be interested in hearing from someone who might
understand why an Xgrid cluster with its attendant.management system
would or would not be equally good for these problems. After all,
there are a reasonable number of Xgrid customers who are getting their
work done.

Maybe I'll need to learn more about both and also engage in some
discussions with the Xgrid community. I do intend to bring up the
Xgrid system on our cluster to see how it works for us.  That'll
certainly deepen my understanding of both.

Thanks for the detailed reply.

 - Bob


On Dec 5, 2007 12:17 AM, Ted Dunning <td...@veoh.com> wrote:
>
> IF you are looking at large numbers of independent images then hadoop should
> be close to perfect for this analysis (the problem is embarrassingly
> parallel).  If you are looking at video, then you can still do quite well by
> building what is essentially a probabilistic list of recognized items in the
> video stream in the map phase, giving all frames from a single shot the same
> reduce key.  Then in the reduce phase, you can correlate the possible
> objects and their probabilities according to object persistence models.  It
> would be good to do another pass after that to do scene to scene
> correlations.  This formulation gives you near perfect parallelism as well.
>
> For NLP, the problem at the level of phrasal analysis can also be made
> trivially parallel if you have large numbers of documents.  Again, you may
> need to do a secondary pass to find duplicated references across multiple
> documents but this is usually far less intensive than the original analysis.
>
> Standard scientific HPC architectures are all about facilitating arbitrary
> communication patterns and process boundaries.  This is exceedingly hard to
> do really well and few systems attain really good performance.  Hadoop is
> all about working with a really simple primitive that is so simple that it
> can be implemented really well with simple and cheap hardware.  What is
> surprising (a bit) is that so many problems can be well expressed as
> map-reduce programs.  Sometimes this is only true at really large scale
> where correlations become small (allowing the map phase to do useful work on
> many sub-units), sometimes it requires relatively large intermediate data
> (such as many graph algorithms).  The fact is, however, that it works
> remarkably well.
>
>
> On 12/4/07 7:12 PM, "Bob Futrelle" <bo...@gmail.com> wrote:
>
> > For us, we want to do pattern recognition, turning
> > raster images into collections of the objects we discover in the
> > images. Another focus for us is NLP, esp. phrasal analysis.
>
>

Re: Comparing Hadoop to Apple Xgrid?

Posted by Ted Dunning <td...@veoh.com>.

IF you are looking at large numbers of independent images then hadoop should
be close to perfect for this analysis (the problem is embarrassingly
parallel).  If you are looking at video, then you can still do quite well by
building what is essentially a probabilistic list of recognized items in the
video stream in the map phase, giving all frames from a single shot the same
reduce key.  Then in the reduce phase, you can correlate the possible
objects and their probabilities according to object persistence models.  It
would be good to do another pass after that to do scene to scene
correlations.  This formulation gives you near perfect parallelism as well.

For NLP, the problem at the level of phrasal analysis can also be made
trivially parallel if you have large numbers of documents.  Again, you may
need to do a secondary pass to find duplicated references across multiple
documents but this is usually far less intensive than the original analysis.

Standard scientific HPC architectures are all about facilitating arbitrary
communication patterns and process boundaries.  This is exceedingly hard to
do really well and few systems attain really good performance.  Hadoop is
all about working with a really simple primitive that is so simple that it
can be implemented really well with simple and cheap hardware.  What is
surprising (a bit) is that so many problems can be well expressed as
map-reduce programs.  Sometimes this is only true at really large scale
where correlations become small (allowing the map phase to do useful work on
many sub-units), sometimes it requires relatively large intermediate data
(such as many graph algorithms).  The fact is, however, that it works
remarkably well.

On 12/4/07 7:12 PM, "Bob Futrelle" <bo...@gmail.com> wrote:

> For us, we want to do pattern recognition, turning
> raster images into collections of the objects we discover in the
> images. Another focus for us is NLP, esp. phrasal analysis.