You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Guilherme Germoglio <ge...@gmail.com> on 2009/04/14 16:16:37 UTC

fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

(Hadoop is used in the benchmarks)

http://database.cs.brown.edu/sigmod09/

There is currently considerable enthusiasm around the MapReduce
(MR) paradigm for large-scale data analysis [17]. Although the
basic control flow of this framework has existed in parallel SQL
database management systems (DBMS) for over 20 years, some
have called MR a dramatically new computing model [8, 17]. In
this paper, we describe and compare both paradigms. Furthermore,
we evaluate both kinds of systems in terms of performance and de-
velopment complexity. To this end, we define a benchmark con-
sisting of a collection of tasks that we have run on an open source
version of MR as well as on two parallel DBMSs. For each task,
we measure each system’s performance for various degrees of par-
allelism on a cluster of 100 nodes. Our results reveal some inter-
esting trade-offs. Although the process to load data into and tune
the execution of parallel DBMSs took much longer than the MR
system, the observed performance of these DBMSs was strikingly
better. We speculate about the causes of the dramatic performance
difference and consider implementation concepts that future sys-
tems should take from both kinds of architectures.


-- 
Guilherme

msn: guigermoglio@hotmail.com
homepage: http://germoglio.googlepages.com

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Posted by Brian Bockelman <bb...@cse.unl.edu>.
On Apr 14, 2009, at 12:47 PM, Guilherme Germoglio wrote:

> Hi Brian,
>
> I'm sorry but it is not my paper. :-) I've posted the link here  
> because
> we're always looking for comparison data -- so, I thought this  
> benchmark
> would be welcome.
>

Ah, sorry, I guess I was being dense when looking at the author list.   
I'm dense a lot.

> Also, I won't attend the conference. However, it would be a good  
> idea to
> someone who will to ask directly to the authors all these questions  
> and
> comments and then post their answers here.
>

It would be interesting!

In the particular field I'm working for (HEP), databases have a long,  
colorful history of failure while MapReduce-like approaches (although  
colored in very, very different terms and some interesting alternate  
optimizations ... perhaps the best way to describe it would be Map- 
Reduce on a partially column-oriented, unstructured data store) have  
survived.

I'm a big fan of "use the right tool for the job".  There are jobs for  
Map-Reduce, there are jobs for DBMS, and (as the authors point out),  
there is overlap and possible cross-pollination between the two.  In  
the end, if the tools get better, everyone wins.

Brian

>
> On Tue, Apr 14, 2009 at 2:26 PM, Brian Bockelman  
> <bb...@cse.unl.edu>wrote:
>
>> Hey Guilherme,
>>
>> It's good to see comparisons, especially as it helps folks understand
>> better what tool is the best for their problem.  As you show in  
>> your paper,
>> a MapReduce system is hideously bad in performing tasks that column- 
>> store
>> databases were designed for (selecting a single value along an index,
>> joining tables).
>>
>> Some comments:
>> 1) For some of your graphs, you show Hadoop's numbers in half-grey,
>> half-white.  I can't figure out for the life of me what this  
>> signifies!
>> What have I overlooked?
>> 2) I see that one of your co-authors is the CEO/inventor of the  
>> Vertica DB.
>> Out of curiosity, how did you interact with Vertica versus Hadoop  
>> versus
>> DBMS-X?  Did you get help tuning the systems from the experts?   
>> I.e., if you
>> sat down with a Hadoop expert for a few days, I'm certain you could  
>> squeeze
>> out more performance, just like whenever I sit down with an Oracle  
>> DBA for a
>> few hours, my DB queries are much faster.  You touch upon the  
>> sociological
>> issues (having to program your own code versus having to only know  
>> SQL, as
>> well as the comparative time it took to set up the DB) - I'd like  
>> to hear
>> how much time you spent "tweaking" and learning the best practices  
>> for the
>> three, very different approaches.  If you added a 5th test, what's  
>> the
>> marginal effort required?
>> 3) It would be nice to see how some of your more DB-like tasks  
>> perform on
>> something like HBase.  That'd be a much more apples-to-apples  
>> comparison of
>> column-store DBMS versus column-store data system, although the  
>> HBase work
>> is just now revving up.  I'm a bit uninformed in that area, so I  
>> don't have
>> a good gut in how that'd do.
>> 4) I think that the UDF aggregation task (calculating the inlink  
>> count for
>> each document in a sample) is interesting - it's a more Map-Reduce  
>> oriented
>> task, and it sounds like it was fairly miserable to hack around the
>> limitations / bugs in the DBMS.
>> 5) I really think you undervalue the benefits of replication and
>> reliability, especially in terms of cost.  As someone who helps  
>> with a small
>> site (about 300 machines) that range from commodity workers to Sun  
>> Thumpers,
>> if your site depends on all your storage nodes functioning, then  
>> your costs
>> go way up.  You can't make cheap hardware scale unless your  
>> software can
>> account for it.
>> - Yes, I realize this is a different approach than you take.  There  
>> are
>> pros and cons to large expensive hardware versus lots of cheap  
>> hardware ...
>> the argument has been going on since the dawn of time.  However,  
>> it's a bit
>> unfair to just outright dismiss one approach.  I am a bit wary of  
>> the claims
>> that your results can scale up to Google/Yahoo scale, but I do  
>> agree that
>> there are truly few users that are that large!
>>
>> I love your last paragraph, it's a very good conclusion.  It kind of
>> reminds me of the grid computing field which was (is?) completely  
>> shocked by
>> the emergence of cloud computing.  After you cut through the hype
>> surrounding the new fads, you find (a) that there are some very  
>> good reasons
>> that the fads are popular - they have definite strengths that the  
>> existing
>> field was missing (or didn't want to hear) and (b) there's a lot of  
>> common
>> ground and learning that has to be done, even to get a good common
>> terminology :)
>>
>> Enjoy your conference!
>>
>> Brian
>>
>> On Apr 14, 2009, at 9:16 AM, Guilherme Germoglio wrote:
>>
>> (Hadoop is used in the benchmarks)
>>>
>>> http://database.cs.brown.edu/sigmod09/
>>>
>>> There is currently considerable enthusiasm around the MapReduce
>>> (MR) paradigm for large-scale data analysis [17]. Although the
>>> basic control flow of this framework has existed in parallel SQL
>>> database management systems (DBMS) for over 20 years, some
>>> have called MR a dramatically new computing model [8, 17]. In
>>> this paper, we describe and compare both paradigms. Furthermore,
>>> we evaluate both kinds of systems in terms of performance and de-
>>> velopment complexity. To this end, we define a benchmark con-
>>> sisting of a collection of tasks that we have run on an open source
>>> version of MR as well as on two parallel DBMSs. For each task,
>>> we measure each system’s performance for various degrees of par-
>>> allelism on a cluster of 100 nodes. Our results reveal some inter-
>>> esting trade-offs. Although the process to load data into and tune
>>> the execution of parallel DBMSs took much longer than the MR
>>> system, the observed performance of these DBMSs was strikingly
>>> better. We speculate about the causes of the dramatic performance
>>> difference and consider implementation concepts that future sys-
>>> tems should take from both kinds of architectures.
>>>
>>>
>>> --
>>> Guilherme
>>>
>>> msn: guigermoglio@hotmail.com
>>> homepage: http://germoglio.googlepages.com
>>>
>>
>>
>
>
> -- 
> Guilherme
>
> msn: guigermoglio@hotmail.com
> homepage: http://germoglio.googlepages.com


Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Posted by Guilherme Germoglio <ge...@gmail.com>.
Hi Brian,

I'm sorry but it is not my paper. :-) I've posted the link here because
we're always looking for comparison data -- so, I thought this benchmark
would be welcome.

Also, I won't attend the conference. However, it would be a good idea to
someone who will to ask directly to the authors all these questions and
comments and then post their answers here.


On Tue, Apr 14, 2009 at 2:26 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hey Guilherme,
>
> It's good to see comparisons, especially as it helps folks understand
> better what tool is the best for their problem.  As you show in your paper,
> a MapReduce system is hideously bad in performing tasks that column-store
> databases were designed for (selecting a single value along an index,
> joining tables).
>
> Some comments:
> 1) For some of your graphs, you show Hadoop's numbers in half-grey,
> half-white.  I can't figure out for the life of me what this signifies!
>  What have I overlooked?
> 2) I see that one of your co-authors is the CEO/inventor of the Vertica DB.
>  Out of curiosity, how did you interact with Vertica versus Hadoop versus
> DBMS-X?  Did you get help tuning the systems from the experts?  I.e., if you
> sat down with a Hadoop expert for a few days, I'm certain you could squeeze
> out more performance, just like whenever I sit down with an Oracle DBA for a
> few hours, my DB queries are much faster.  You touch upon the sociological
> issues (having to program your own code versus having to only know SQL, as
> well as the comparative time it took to set up the DB) - I'd like to hear
> how much time you spent "tweaking" and learning the best practices for the
> three, very different approaches.  If you added a 5th test, what's the
> marginal effort required?
> 3) It would be nice to see how some of your more DB-like tasks perform on
> something like HBase.  That'd be a much more apples-to-apples comparison of
> column-store DBMS versus column-store data system, although the HBase work
> is just now revving up.  I'm a bit uninformed in that area, so I don't have
> a good gut in how that'd do.
> 4) I think that the UDF aggregation task (calculating the inlink count for
> each document in a sample) is interesting - it's a more Map-Reduce oriented
> task, and it sounds like it was fairly miserable to hack around the
> limitations / bugs in the DBMS.
> 5) I really think you undervalue the benefits of replication and
> reliability, especially in terms of cost.  As someone who helps with a small
> site (about 300 machines) that range from commodity workers to Sun Thumpers,
> if your site depends on all your storage nodes functioning, then your costs
> go way up.  You can't make cheap hardware scale unless your software can
> account for it.
>  - Yes, I realize this is a different approach than you take.  There are
> pros and cons to large expensive hardware versus lots of cheap hardware ...
> the argument has been going on since the dawn of time.  However, it's a bit
> unfair to just outright dismiss one approach.  I am a bit wary of the claims
> that your results can scale up to Google/Yahoo scale, but I do agree that
> there are truly few users that are that large!
>
> I love your last paragraph, it's a very good conclusion.  It kind of
> reminds me of the grid computing field which was (is?) completely shocked by
> the emergence of cloud computing.  After you cut through the hype
> surrounding the new fads, you find (a) that there are some very good reasons
> that the fads are popular - they have definite strengths that the existing
> field was missing (or didn't want to hear) and (b) there's a lot of common
> ground and learning that has to be done, even to get a good common
> terminology :)
>
> Enjoy your conference!
>
> Brian
>
> On Apr 14, 2009, at 9:16 AM, Guilherme Germoglio wrote:
>
>  (Hadoop is used in the benchmarks)
>>
>> http://database.cs.brown.edu/sigmod09/
>>
>> There is currently considerable enthusiasm around the MapReduce
>> (MR) paradigm for large-scale data analysis [17]. Although the
>> basic control flow of this framework has existed in parallel SQL
>> database management systems (DBMS) for over 20 years, some
>> have called MR a dramatically new computing model [8, 17]. In
>> this paper, we describe and compare both paradigms. Furthermore,
>> we evaluate both kinds of systems in terms of performance and de-
>> velopment complexity. To this end, we define a benchmark con-
>> sisting of a collection of tasks that we have run on an open source
>> version of MR as well as on two parallel DBMSs. For each task,
>> we measure each system’s performance for various degrees of par-
>> allelism on a cluster of 100 nodes. Our results reveal some inter-
>> esting trade-offs. Although the process to load data into and tune
>> the execution of parallel DBMSs took much longer than the MR
>> system, the observed performance of these DBMSs was strikingly
>> better. We speculate about the causes of the dramatic performance
>> difference and consider implementation concepts that future sys-
>> tems should take from both kinds of architectures.
>>
>>
>> --
>> Guilherme
>>
>> msn: guigermoglio@hotmail.com
>> homepage: http://germoglio.googlepages.com
>>
>
>


-- 
Guilherme

msn: guigermoglio@hotmail.com
homepage: http://germoglio.googlepages.com

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Guilherme,

It's good to see comparisons, especially as it helps folks understand  
better what tool is the best for their problem.  As you show in your  
paper, a MapReduce system is hideously bad in performing tasks that  
column-store databases were designed for (selecting a single value  
along an index, joining tables).

Some comments:
1) For some of your graphs, you show Hadoop's numbers in half-grey,  
half-white.  I can't figure out for the life of me what this  
signifies!  What have I overlooked?
2) I see that one of your co-authors is the CEO/inventor of the  
Vertica DB.  Out of curiosity, how did you interact with Vertica  
versus Hadoop versus DBMS-X?  Did you get help tuning the systems from  
the experts?  I.e., if you sat down with a Hadoop expert for a few  
days, I'm certain you could squeeze out more performance, just like  
whenever I sit down with an Oracle DBA for a few hours, my DB queries  
are much faster.  You touch upon the sociological issues (having to  
program your own code versus having to only know SQL, as well as the  
comparative time it took to set up the DB) - I'd like to hear how much  
time you spent "tweaking" and learning the best practices for the  
three, very different approaches.  If you added a 5th test, what's the  
marginal effort required?
3) It would be nice to see how some of your more DB-like tasks perform  
on something like HBase.  That'd be a much more apples-to-apples  
comparison of column-store DBMS versus column-store data system,  
although the HBase work is just now revving up.  I'm a bit uninformed  
in that area, so I don't have a good gut in how that'd do.
4) I think that the UDF aggregation task (calculating the inlink count  
for each document in a sample) is interesting - it's a more Map-Reduce  
oriented task, and it sounds like it was fairly miserable to hack  
around the limitations / bugs in the DBMS.
5) I really think you undervalue the benefits of replication and  
reliability, especially in terms of cost.  As someone who helps with a  
small site (about 300 machines) that range from commodity workers to  
Sun Thumpers, if your site depends on all your storage nodes  
functioning, then your costs go way up.  You can't make cheap hardware  
scale unless your software can account for it.
   - Yes, I realize this is a different approach than you take.  There  
are pros and cons to large expensive hardware versus lots of cheap  
hardware ... the argument has been going on since the dawn of time.   
However, it's a bit unfair to just outright dismiss one approach.  I  
am a bit wary of the claims that your results can scale up to Google/ 
Yahoo scale, but I do agree that there are truly few users that are  
that large!

I love your last paragraph, it's a very good conclusion.  It kind of  
reminds me of the grid computing field which was (is?) completely  
shocked by the emergence of cloud computing.  After you cut through  
the hype surrounding the new fads, you find (a) that there are some  
very good reasons that the fads are popular - they have definite  
strengths that the existing field was missing (or didn't want to hear)  
and (b) there's a lot of common ground and learning that has to be  
done, even to get a good common terminology :)

Enjoy your conference!

Brian

On Apr 14, 2009, at 9:16 AM, Guilherme Germoglio wrote:

> (Hadoop is used in the benchmarks)
>
> http://database.cs.brown.edu/sigmod09/
>
> There is currently considerable enthusiasm around the MapReduce
> (MR) paradigm for large-scale data analysis [17]. Although the
> basic control flow of this framework has existed in parallel SQL
> database management systems (DBMS) for over 20 years, some
> have called MR a dramatically new computing model [8, 17]. In
> this paper, we describe and compare both paradigms. Furthermore,
> we evaluate both kinds of systems in terms of performance and de-
> velopment complexity. To this end, we define a benchmark con-
> sisting of a collection of tasks that we have run on an open source
> version of MR as well as on two parallel DBMSs. For each task,
> we measure each system’s performance for various degrees of par-
> allelism on a cluster of 100 nodes. Our results reveal some inter-
> esting trade-offs. Although the process to load data into and tune
> the execution of parallel DBMSs took much longer than the MR
> system, the observed performance of these DBMSs was strikingly
> better. We speculate about the causes of the dramatic performance
> difference and consider implementation concepts that future sys-
> tems should take from both kinds of architectures.
>
>
> -- 
> Guilherme
>
> msn: guigermoglio@hotmail.com
> homepage: http://germoglio.googlepages.com


Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Posted by tim robertson <ti...@gmail.com>.
Thanks for sharing this - I find these comparisons really interesting.
 I have a small comment after skimming this very quickly.
[Please accept my apologies for commenting on such a trivial thing,
but personal experience has shown this really influences performance]

One thing not touched on in the article is the need for developers to
take into account performance gains when writing MapReduce programs,
much like they do when making sure the DB query optimizer is doing the
join order sensibly.

For example in your MR code you have 2 simple places to improve performance:

  String fields[] = value.toString().split("\\" +
BenchmarkBase.VALUE_DELIMITER);

This will create a String for "\\" + BenchmarkBase.VALUE_DELIMITER,
and then compile a pattern for it, then split the input and then the
new String and the Pattern are god for garbage collection.

String splitting like this is far quicker with a precompiled Pattern
that you reuse:
  static Pattern splitter = Pattern.compile("\\" +
BenchmarkBase.VALUE_DELIMITER);
  ....
  splitter.split(value.toString());

A simple loop of splitting 100000 records has 431msec to 69msec on my
2G macbook pro.  Now consider what happens when splitting Billions of
rows (it only gets worse with a bigger input string).

The other gain is object reusing rather than creation:
  key = new Text(key.toString().substring(0, 7));

Unnecessary Object creation and garbage collection kills Java
performance in any application.

(I haven't seen it in your code, but another performance gain is
reliance on Exceptions where if/else clauses perform far quicker.)

These are really trivial things that people often overlook but when
you are running these operations billions of times it really adds up
and is analogous to using BigInteger on a DB column with an Index
where a SmallInteger will do.

Again - I apologise for commenting on such a trivial thing (I really
feel stupid commenting on how to split a String in Java efficiently to
this mailing list), but might be worth considering when doing these
kind of tests - and like you say RDBMS has 20 years of these
performance tweaks.  Of course the fact that RDBMS mostly shield
people from these low level things is a huge benefit and might be
worth mentioning.

Cheers,

Tim












On Tue, Apr 14, 2009 at 4:16 PM, Guilherme Germoglio
<ge...@gmail.com> wrote:
> (Hadoop is used in the benchmarks)
>
> http://database.cs.brown.edu/sigmod09/
>
> There is currently considerable enthusiasm around the MapReduce
> (MR) paradigm for large-scale data analysis [17]. Although the
> basic control flow of this framework has existed in parallel SQL
> database management systems (DBMS) for over 20 years, some
> have called MR a dramatically new computing model [8, 17]. In
> this paper, we describe and compare both paradigms. Furthermore,
> we evaluate both kinds of systems in terms of performance and de-
> velopment complexity. To this end, we define a benchmark con-
> sisting of a collection of tasks that we have run on an open source
> version of MR as well as on two parallel DBMSs. For each task,
> we measure each system’s performance for various degrees of par-
> allelism on a cluster of 100 nodes. Our results reveal some inter-
> esting trade-offs. Although the process to load data into and tune
> the execution of parallel DBMSs took much longer than the MR
> system, the observed performance of these DBMSs was strikingly
> better. We speculate about the causes of the dramatic performance
> difference and consider implementation concepts that future sys-
> tems should take from both kinds of architectures.
>
>
> --
> Guilherme
>
> msn: guigermoglio@hotmail.com
> homepage: http://germoglio.googlepages.com
>

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Posted by Bryan Duxbury <br...@rapleaf.com>.
I thought it a conspicuous omission to not discuss the cost of  
various approaches. Hadoop is free, though you have to spend  
developer time; how much does Vertica cost on 100 nodes?

-Bryan

On Apr 14, 2009, at 7:16 AM, Guilherme Germoglio wrote:

> (Hadoop is used in the benchmarks)
>
> http://database.cs.brown.edu/sigmod09/
>
> There is currently considerable enthusiasm around the MapReduce
> (MR) paradigm for large-scale data analysis [17]. Although the
> basic control flow of this framework has existed in parallel SQL
> database management systems (DBMS) for over 20 years, some
> have called MR a dramatically new computing model [8, 17]. In
> this paper, we describe and compare both paradigms. Furthermore,
> we evaluate both kinds of systems in terms of performance and de-
> velopment complexity. To this end, we define a benchmark con-
> sisting of a collection of tasks that we have run on an open source
> version of MR as well as on two parallel DBMSs. For each task,
> we measure each system’s performance for various degrees of par-
> allelism on a cluster of 100 nodes. Our results reveal some inter-
> esting trade-offs. Although the process to load data into and tune
> the execution of parallel DBMSs took much longer than the MR
> system, the observed performance of these DBMSs was strikingly
> better. We speculate about the causes of the dramatic performance
> difference and consider implementation concepts that future sys-
> tems should take from both kinds of architectures.
>
>
> -- 
> Guilherme
>
> msn: guigermoglio@hotmail.com
> homepage: http://germoglio.googlepages.com


Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Posted by Bradford Stephens <br...@gmail.com>.
There's definitely a false dichotomy to this paper, and I think it's a
tad disingenuous. It's titled "A Comparison Of Approaches To Large
Scale Data Analysis", when it should be titled "A Comparison of
Parallel RDBMSs to MapReduce for RDBMS-specific problems". There's
little surprise that the people who wrote the paper have been
"gunning" for Hadoop for quite a while -- they've written papers
before which describe MR as a "Big Step Backwards". Not to mention the
primary authors are a CTO of Vertica, a parallel DB company, and a
lead tech from Microsoft.

We all know MapReduce is not meant for non-parallelizable, non-indexed
tasks like O(1) access to data,table joins, grepping indexed stuff,
etc. MapReduce excels at highly parallelizable tasks, like keyword and
document indexing, web crawling, gene sequencing, etc.

What would have been *great*, and what I'm working on a whitepaper
for, is a study on what classes of problems are ideal for parallel
RDBMs, what are ideal for MapReduce, and then performance timing on
those solutions.

The study is about as useful as if I had written "Comparison of
Approaches to Operating System File Allocation Table Management", and
then compared SQL and Ext3.

Yes, I'm in one of *those* moods today :)

Cheers,
Bradford

On Wed, Apr 15, 2009 at 8:22 AM, Jonathan Gray <jl...@streamy.com> wrote:
> I agree with you, Andy.
>
> This seems to be a great look into what Hadoop MapReduce is not good at.
>
> Over in the HBase world, we constantly deal with comparisons like this to
> RDBMSs, trying to determine if one is better than the other.  It's a false
> choice and completely depends on the use case.
>
> Hadoop is not suited for random access, joins, dealing with subsets of
> your data; ie. it is not a relational database!  It's designed to
> distribute a full scan of a large dataset, placing tasks on the same nodes
> as the data its processing.  The emphasis is on task scheduling, fault
> tolerance, and very large datasets, low-latency has not been a priority.
> There are no "indexes" to speak of, it's completely orthogonal to what it
> does, so of course there is an enormous disparity in cases where that
> makes sense.  Yes, B-Tree indexes are a wonderful breakthrough in data
> technology :)
>
> In short, I'm using Hadoop (HDFS and MapReduce) for a broad spectrum of
> applications including batch log processing, web crawling, and number of
> machine learning and natural language processing jobs... These may not be
> tasks that DBMS-X or Vertica would be good at, if even capable of them,
> but all things that I would include under "Large-Scale Data Analysis".
>
> Would have been really interesting to see how things like Pig, Hive, and
> Cascading would stack up against DBMS-X/Vertica for very complex,
> multi-join/sort/etc queries, across a broad spectrum of use cases and
> dataset/result sizes.
>
> There are a wide variety of solutions to the problems out there.  It's
> important to know the strengths and weaknesses of each, so a bit
> unfortunate that this paper set the stage as it did.
>
> JG
>
> On Wed, April 15, 2009 6:44 am, Andy Liu wrote:
>> Not sure if comparing Hadoop to databases is an apples to apples
>> comparison.  Hadoop is a complete job execution framework, which
>> collocates the data with the computation.  I suppose DBMS-X and Vertica do
>> that to some certain extent, by way of SQL, but you're restricted to that.
>> If you want
>> to say, build a distributed web crawler, or a complex data processing
>> pipeline, Hadoop will schedule those processes across a cluster for you,
>> while Vertica and DBMS-X only deal with the storage of the data.
>>
>> The choice of experiments seemed skewed towards DBMS-X and Vertica.  I
>> think everybody is aware that Map-Reduce is inefficient for handling
>> SQL-like
>> queries and joins.
>>
>> It's also worth noting that I think 4 out of the 7 authors either
>> currently or at one time work with Vertica (or c-store, the precursor to
>> Vertica).
>>
>>
>> Andy
>>
>>
>> On Tue, Apr 14, 2009 at 10:16 AM, Guilherme Germoglio
>> <ge...@gmail.com>wrote:
>>
>>
>>> (Hadoop is used in the benchmarks)
>>>
>>>
>>> http://database.cs.brown.edu/sigmod09/
>>>
>>>
>>> There is currently considerable enthusiasm around the MapReduce
>>> (MR) paradigm for large-scale data analysis [17]. Although the
>>> basic control flow of this framework has existed in parallel SQL
>>> database management systems (DBMS) for over 20 years, some have called
>>> MR a dramatically new computing model [8, 17]. In
>>> this paper, we describe and compare both paradigms. Furthermore, we
>>> evaluate both kinds of systems in terms of performance and de- velopment
>>> complexity. To this end, we define a benchmark con- sisting of a
>>> collection of tasks that we have run on an open source version of MR as
>>> well as on two parallel DBMSs. For each task, we measure each system’s
>>> performance for various degrees of par- allelism on a cluster of 100
>>> nodes. Our results reveal some inter- esting trade-offs. Although the
>>> process to load data into and tune the execution of parallel DBMSs took
>>> much longer than the MR system, the observed performance of these DBMSs
>>> was strikingly better. We speculate about the causes of the dramatic
>>> performance difference and consider implementation concepts that future
>>> sys- tems should take from both kinds of architectures.
>>>
>>>
>>> --
>>> Guilherme
>>>
>>>
>>> msn: guigermoglio@hotmail.com
>>> homepage: http://germoglio.googlepages.com
>>>
>>>
>>
>
>

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Posted by Jonathan Gray <jl...@streamy.com>.
I agree with you, Andy.

This seems to be a great look into what Hadoop MapReduce is not good at.

Over in the HBase world, we constantly deal with comparisons like this to
RDBMSs, trying to determine if one is better than the other.  It's a false
choice and completely depends on the use case.

Hadoop is not suited for random access, joins, dealing with subsets of
your data; ie. it is not a relational database!  It's designed to
distribute a full scan of a large dataset, placing tasks on the same nodes
as the data its processing.  The emphasis is on task scheduling, fault
tolerance, and very large datasets, low-latency has not been a priority. 
There are no "indexes" to speak of, it's completely orthogonal to what it
does, so of course there is an enormous disparity in cases where that
makes sense.  Yes, B-Tree indexes are a wonderful breakthrough in data
technology :)

In short, I'm using Hadoop (HDFS and MapReduce) for a broad spectrum of
applications including batch log processing, web crawling, and number of
machine learning and natural language processing jobs... These may not be
tasks that DBMS-X or Vertica would be good at, if even capable of them,
but all things that I would include under "Large-Scale Data Analysis".

Would have been really interesting to see how things like Pig, Hive, and
Cascading would stack up against DBMS-X/Vertica for very complex,
multi-join/sort/etc queries, across a broad spectrum of use cases and
dataset/result sizes.

There are a wide variety of solutions to the problems out there.  It's
important to know the strengths and weaknesses of each, so a bit
unfortunate that this paper set the stage as it did.

JG

On Wed, April 15, 2009 6:44 am, Andy Liu wrote:
> Not sure if comparing Hadoop to databases is an apples to apples
> comparison.  Hadoop is a complete job execution framework, which
> collocates the data with the computation.  I suppose DBMS-X and Vertica do
> that to some certain extent, by way of SQL, but you're restricted to that.
> If you want
> to say, build a distributed web crawler, or a complex data processing
> pipeline, Hadoop will schedule those processes across a cluster for you,
> while Vertica and DBMS-X only deal with the storage of the data.
>
> The choice of experiments seemed skewed towards DBMS-X and Vertica.  I
> think everybody is aware that Map-Reduce is inefficient for handling
> SQL-like
> queries and joins.
>
> It's also worth noting that I think 4 out of the 7 authors either
> currently or at one time work with Vertica (or c-store, the precursor to
> Vertica).
>
>
> Andy
>
>
> On Tue, Apr 14, 2009 at 10:16 AM, Guilherme Germoglio
> <ge...@gmail.com>wrote:
>
>
>> (Hadoop is used in the benchmarks)
>>
>>
>> http://database.cs.brown.edu/sigmod09/
>>
>>
>> There is currently considerable enthusiasm around the MapReduce
>> (MR) paradigm for large-scale data analysis [17]. Although the
>> basic control flow of this framework has existed in parallel SQL
>> database management systems (DBMS) for over 20 years, some have called
>> MR a dramatically new computing model [8, 17]. In
>> this paper, we describe and compare both paradigms. Furthermore, we
>> evaluate both kinds of systems in terms of performance and de- velopment
>> complexity. To this end, we define a benchmark con- sisting of a
>> collection of tasks that we have run on an open source version of MR as
>> well as on two parallel DBMSs. For each task, we measure each system’s
>> performance for various degrees of par- allelism on a cluster of 100
>> nodes. Our results reveal some inter- esting trade-offs. Although the
>> process to load data into and tune the execution of parallel DBMSs took
>> much longer than the MR system, the observed performance of these DBMSs
>> was strikingly better. We speculate about the causes of the dramatic
>> performance difference and consider implementation concepts that future
>> sys- tems should take from both kinds of architectures.
>>
>>
>> --
>> Guilherme
>>
>>
>> msn: guigermoglio@hotmail.com
>> homepage: http://germoglio.googlepages.com
>>
>>
>


Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Posted by Andy Liu <an...@gmail.com>.
Not sure if comparing Hadoop to databases is an apples to apples
comparison.  Hadoop is a complete job execution framework, which collocates
the data with the computation.  I suppose DBMS-X and Vertica do that to some
certain extent, by way of SQL, but you're restricted to that.  If you want
to say, build a distributed web crawler, or a complex data processing
pipeline, Hadoop will schedule those processes across a cluster for you,
while Vertica and DBMS-X only deal with the storage of the data.

The choice of experiments seemed skewed towards DBMS-X and Vertica.  I think
everybody is aware that Map-Reduce is inefficient for handling SQL-like
queries and joins.

It's also worth noting that I think 4 out of the 7 authors either currently
or at one time work with Vertica (or c-store, the precursor to Vertica).

Andy

On Tue, Apr 14, 2009 at 10:16 AM, Guilherme Germoglio
<ge...@gmail.com>wrote:

> (Hadoop is used in the benchmarks)
>
> http://database.cs.brown.edu/sigmod09/
>
> There is currently considerable enthusiasm around the MapReduce
> (MR) paradigm for large-scale data analysis [17]. Although the
> basic control flow of this framework has existed in parallel SQL
> database management systems (DBMS) for over 20 years, some
> have called MR a dramatically new computing model [8, 17]. In
> this paper, we describe and compare both paradigms. Furthermore,
> we evaluate both kinds of systems in terms of performance and de-
> velopment complexity. To this end, we define a benchmark con-
> sisting of a collection of tasks that we have run on an open source
> version of MR as well as on two parallel DBMSs. For each task,
> we measure each system’s performance for various degrees of par-
> allelism on a cluster of 100 nodes. Our results reveal some inter-
> esting trade-offs. Although the process to load data into and tune
> the execution of parallel DBMSs took much longer than the MR
> system, the observed performance of these DBMSs was strikingly
> better. We speculate about the causes of the dramatic performance
> difference and consider implementation concepts that future sys-
> tems should take from both kinds of architectures.
>
>
> --
> Guilherme
>
> msn: guigermoglio@hotmail.com
> homepage: http://germoglio.googlepages.com
>

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Posted by Steve Loughran <st...@apache.org>.
Andrew Newman wrote:
> They are comparing an indexed system with one that isn't.  Why is
> Hadoop faster at loading than the others?  Surely no one would be
> surprised that it would be slower - I'm surprised at how well Hadoop
> does.  Who want to write a paper for next year, "grep vs reverse
> index"?
> 
> 2009/4/15 Guilherme Germoglio <ge...@gmail.com>:
>> (Hadoop is used in the benchmarks)
>>
>> http://database.cs.brown.edu/sigmod09/
>>

I think it is interesting, though it misses the point that the reason 
that few datasets are >1PB today is nobody could afford to store or 
process the data. With Hadoop cost is somewhat high (learn to patch the 
source to fix your cluster's problems) but scales well with the #of 
nodes.  Commodity storage costs (my own home now has >2TB of storage) 
and commodity software costs are compatible.

Some other things to look at

-power efficiency. I actually think the DBs could come out better
-ease of writing applications by skilled developers. Pig vs SQL
-performance under different workloads (take a set of log files growing 
continually, mine it in near-real time. I think the last.fm use case 
would be a good one)


One of the great ironies of SQL is most developers dont go near it, as 
it is a detail handed by the O/R mapping engine, except when building 
SQL selects for web pages. If Pig makes M/R easy, would it be used -and 
if so, does that show that we developers prefer procedural thinking?

-steve




Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Posted by Tarandeep Singh <ta...@gmail.com>.
I think there is one important comparison missing in the paper- cost. The
paper does mention that Hadoop comes for free in the conclusion, but didn't
give any details of how much it would cost to get license for Vertica or
DBMS X to run on 100 nodes.

Further, with data warehouse products like Hive, CloudBase built on top of
Hadoop, one can use SQL to query the data while using Hadoop underneath.

Thanks for sharing the link,
Tarandeep

On Tue, Apr 14, 2009 at 1:39 PM, Andrew Newman <an...@gmail.com>wrote:

> They are comparing an indexed system with one that isn't.  Why is
> Hadoop faster at loading than the others?  Surely no one would be
> surprised that it would be slower - I'm surprised at how well Hadoop
> does.  Who want to write a paper for next year, "grep vs reverse
> index"?
>
> 2009/4/15 Guilherme Germoglio <ge...@gmail.com>:
> > (Hadoop is used in the benchmarks)
> >
> > http://database.cs.brown.edu/sigmod09/
> >
>

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Posted by Andrew Newman <an...@gmail.com>.
They are comparing an indexed system with one that isn't.  Why is
Hadoop faster at loading than the others?  Surely no one would be
surprised that it would be slower - I'm surprised at how well Hadoop
does.  Who want to write a paper for next year, "grep vs reverse
index"?

2009/4/15 Guilherme Germoglio <ge...@gmail.com>:
> (Hadoop is used in the benchmarks)
>
> http://database.cs.brown.edu/sigmod09/
>