You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jonathan Gray <jl...@streamy.com> on 2009/04/15 17:22:43 UTC

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

I agree with you, Andy.

This seems to be a great look into what Hadoop MapReduce is not good at.

Over in the HBase world, we constantly deal with comparisons like this to
RDBMSs, trying to determine if one is better than the other.  It's a false
choice and completely depends on the use case.

Hadoop is not suited for random access, joins, dealing with subsets of
your data; ie. it is not a relational database!  It's designed to
distribute a full scan of a large dataset, placing tasks on the same nodes
as the data its processing.  The emphasis is on task scheduling, fault
tolerance, and very large datasets, low-latency has not been a priority. 
There are no "indexes" to speak of, it's completely orthogonal to what it
does, so of course there is an enormous disparity in cases where that
makes sense.  Yes, B-Tree indexes are a wonderful breakthrough in data
technology :)

In short, I'm using Hadoop (HDFS and MapReduce) for a broad spectrum of
applications including batch log processing, web crawling, and number of
machine learning and natural language processing jobs... These may not be
tasks that DBMS-X or Vertica would be good at, if even capable of them,
but all things that I would include under "Large-Scale Data Analysis".

Would have been really interesting to see how things like Pig, Hive, and
Cascading would stack up against DBMS-X/Vertica for very complex,
multi-join/sort/etc queries, across a broad spectrum of use cases and
dataset/result sizes.

There are a wide variety of solutions to the problems out there.  It's
important to know the strengths and weaknesses of each, so a bit
unfortunate that this paper set the stage as it did.

JG

On Wed, April 15, 2009 6:44 am, Andy Liu wrote:
> Not sure if comparing Hadoop to databases is an apples to apples
> comparison.  Hadoop is a complete job execution framework, which
> collocates the data with the computation.  I suppose DBMS-X and Vertica do
> that to some certain extent, by way of SQL, but you're restricted to that.
> If you want
> to say, build a distributed web crawler, or a complex data processing
> pipeline, Hadoop will schedule those processes across a cluster for you,
> while Vertica and DBMS-X only deal with the storage of the data.
>
> The choice of experiments seemed skewed towards DBMS-X and Vertica.  I
> think everybody is aware that Map-Reduce is inefficient for handling
> SQL-like
> queries and joins.
>
> It's also worth noting that I think 4 out of the 7 authors either
> currently or at one time work with Vertica (or c-store, the precursor to
> Vertica).
>
>
> Andy
>
>
> On Tue, Apr 14, 2009 at 10:16 AM, Guilherme Germoglio
> <ge...@gmail.com>wrote:
>
>
>> (Hadoop is used in the benchmarks)
>>
>>
>> http://database.cs.brown.edu/sigmod09/
>>
>>
>> There is currently considerable enthusiasm around the MapReduce
>> (MR) paradigm for large-scale data analysis [17]. Although the
>> basic control ï¬ow of this framework has existed in parallel SQL
>> database management systems (DBMS) for over 20 years, some have called
>> MR a dramatically new computing model [8, 17]. In
>> this paper, we describe and compare both paradigms. Furthermore, we
>> evaluate both kinds of systems in terms of performance and de- velopment
>> complexity. To this end, we deï¬ne a benchmark con- sisting of a
>> collection of tasks that we have run on an open source version of MR as
>> well as on two parallel DBMSs. For each task, we measure each systemâs
>> performance for various degrees of par- allelism on a cluster of 100
>> nodes. Our results reveal some inter- esting trade-offs. Although the
>> process to load data into and tune the execution of parallel DBMSs took
>> much longer than the MR system, the observed performance of these DBMSs
>> was strikingly better. We speculate about the causes of the dramatic
>> performance difference and consider implementation concepts that future
>> sys- tems should take from both kinds of architectures.
>>
>>
>> --
>> Guilherme
>>
>>
>> msn: guigermoglio@hotmail.com
>> homepage: http://germoglio.googlepages.com
>>
>>
>

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Posted by Bradford Stephens <br...@gmail.com>.

There's definitely a false dichotomy to this paper, and I think it's a
tad disingenuous. It's titled "A Comparison Of Approaches To Large
Scale Data Analysis", when it should be titled "A Comparison of
Parallel RDBMSs to MapReduce for RDBMS-specific problems". There's
little surprise that the people who wrote the paper have been
"gunning" for Hadoop for quite a while -- they've written papers
before which describe MR as a "Big Step Backwards". Not to mention the
primary authors are a CTO of Vertica, a parallel DB company, and a
lead tech from Microsoft.

We all know MapReduce is not meant for non-parallelizable, non-indexed
tasks like O(1) access to data,table joins, grepping indexed stuff,
etc. MapReduce excels at highly parallelizable tasks, like keyword and
document indexing, web crawling, gene sequencing, etc.

What would have been *great*, and what I'm working on a whitepaper
for, is a study on what classes of problems are ideal for parallel
RDBMs, what are ideal for MapReduce, and then performance timing on
those solutions.

The study is about as useful as if I had written "Comparison of
Approaches to Operating System File Allocation Table Management", and
then compared SQL and Ext3.

Yes, I'm in one of *those* moods today :)

Cheers,
Bradford

On Wed, Apr 15, 2009 at 8:22 AM, Jonathan Gray <jl...@streamy.com> wrote:
> I agree with you, Andy.
>
> This seems to be a great look into what Hadoop MapReduce is not good at.
>
> Over in the HBase world, we constantly deal with comparisons like this to
> RDBMSs, trying to determine if one is better than the other.  It's a false
> choice and completely depends on the use case.
>
> Hadoop is not suited for random access, joins, dealing with subsets of
> your data; ie. it is not a relational database!  It's designed to
> distribute a full scan of a large dataset, placing tasks on the same nodes
> as the data its processing.  The emphasis is on task scheduling, fault
> tolerance, and very large datasets, low-latency has not been a priority.
> There are no "indexes" to speak of, it's completely orthogonal to what it
> does, so of course there is an enormous disparity in cases where that
> makes sense.  Yes, B-Tree indexes are a wonderful breakthrough in data
> technology :)
>
> In short, I'm using Hadoop (HDFS and MapReduce) for a broad spectrum of
> applications including batch log processing, web crawling, and number of
> machine learning and natural language processing jobs... These may not be
> tasks that DBMS-X or Vertica would be good at, if even capable of them,
> but all things that I would include under "Large-Scale Data Analysis".
>
> Would have been really interesting to see how things like Pig, Hive, and
> Cascading would stack up against DBMS-X/Vertica for very complex,
> multi-join/sort/etc queries, across a broad spectrum of use cases and
> dataset/result sizes.
>
> There are a wide variety of solutions to the problems out there.  It's
> important to know the strengths and weaknesses of each, so a bit
> unfortunate that this paper set the stage as it did.
>
> JG
>
> On Wed, April 15, 2009 6:44 am, Andy Liu wrote:
>> Not sure if comparing Hadoop to databases is an apples to apples
>> comparison.  Hadoop is a complete job execution framework, which
>> collocates the data with the computation.  I suppose DBMS-X and Vertica do
>> that to some certain extent, by way of SQL, but you're restricted to that.
>> If you want
>> to say, build a distributed web crawler, or a complex data processing
>> pipeline, Hadoop will schedule those processes across a cluster for you,
>> while Vertica and DBMS-X only deal with the storage of the data.
>>
>> The choice of experiments seemed skewed towards DBMS-X and Vertica.  I
>> think everybody is aware that Map-Reduce is inefficient for handling
>> SQL-like
>> queries and joins.
>>
>> It's also worth noting that I think 4 out of the 7 authors either
>> currently or at one time work with Vertica (or c-store, the precursor to
>> Vertica).
>>
>>
>> Andy
>>
>>
>> On Tue, Apr 14, 2009 at 10:16 AM, Guilherme Germoglio
>> <ge...@gmail.com>wrote:
>>
>>
>>> (Hadoop is used in the benchmarks)
>>>
>>>
>>> http://database.cs.brown.edu/sigmod09/
>>>
>>>
>>> There is currently considerable enthusiasm around the MapReduce
>>> (MR) paradigm for large-scale data analysis [17]. Although the
>>> basic control ﬂow of this framework has existed in parallel SQL
>>> database management systems (DBMS) for over 20 years, some have called
>>> MR a dramatically new computing model [8, 17]. In
>>> this paper, we describe and compare both paradigms. Furthermore, we
>>> evaluate both kinds of systems in terms of performance and de- velopment
>>> complexity. To this end, we deﬁne a benchmark con- sisting of a
>>> collection of tasks that we have run on an open source version of MR as
>>> well as on two parallel DBMSs. For each task, we measure each system’s
>>> performance for various degrees of par- allelism on a cluster of 100
>>> nodes. Our results reveal some inter- esting trade-offs. Although the
>>> process to load data into and tune the execution of parallel DBMSs took
>>> much longer than the MR system, the observed performance of these DBMSs
>>> was strikingly better. We speculate about the causes of the dramatic
>>> performance difference and consider implementation concepts that future
>>> sys- tems should take from both kinds of architectures.
>>>
>>>
>>> --
>>> Guilherme
>>>
>>>
>>> msn: guigermoglio@hotmail.com
>>> homepage: http://germoglio.googlepages.com
>>>
>>>
>>
>
>