You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by CubicDesign <cu...@gmail.com> on 2009/09/14 23:03:55 UTC

HadoopDB and similar stuff

Hi.

Anybody has experience a DB that can handle large amounts of data on top 
of Hadoop?
HBase and Hive is nice but they also lack of some features. HadoopDB 
seems to bring some equilibrium. However, it seems to be still an infant 
project.

Any thoughts?

Re: HadoopDB and similar stuff

Posted by Ted Dunning <te...@gmail.com>.

Great description.

On Tue, Sep 15, 2009 at 12:01 PM, Edward Capriolo <ed...@gmail.com>wrote:

>
> I often describe Hive query language in this way:
>
> "If you know SQL I can teach you Hive-QL rather quickly."




-- 
Ted Dunning, CTO
DeepDyve

Re: HadoopDB and similar stuff

Posted by Edward Capriolo <ed...@gmail.com>.

On Tue, Sep 15, 2009 at 12:23 PM, Ted Dunning <te...@gmail.com> wrote:
> I don't need to be amazed.  I am a strong proponent of map-reduce.  People
> forget, but I bought the beer at the first Hadoop summit at Gordon Biersch.
>
> I just don't think that selling Hive or Pig as SQL is fair to the buyer or
> the seller.  They aren't the same and have very different virtues.
>
> On Tue, Sep 15, 2009 at 8:34 AM, Edward Capriolo <ed...@gmail.com>wrote:
>
>> Hive is not 100% SQL, but I would say join the Hive user list and be
>> amazed. New types of joins, theta-join, etc have been added by user
>> request. Most of the time if you can't do something you would expect
>> to do in SQL there is a work around.
>>
>> The flip side is true as well, Hive has specific support that other
>> databases don't :)
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Ted,

I meant that I am more amazed by it, not that you should be amazed by
it :) There have been several tickets opened up like "Let Hive do
theta Join" and then sometimes, with in days, hive trunk supports it.
That is pretty impressive to me.

As for
>>I just don't think that selling Hive or Pig as SQL is fair to the buyer or
>>the seller.  They aren't the same and have very different virtues.

I agree. I think I qualified that as my opinion. However, I will say
that I believe I have received some unsolicited from Aster. Likewise,
I notice Aster will reply to blogs about hadoop and plug away, without
really referencing the topic of the blog in any specific way. So, I
would argue the precedent is set.  Maybe, it is just my pet-peeve.

The philosophical "What is SQL?" is hard to answer. Being different
SQL standards like http://en.wikipedia.org/wiki/SQL-92 or SQL-89 may
only be partially supported by a particular vendor. Every
implementation adds/subtracts features.

I often describe Hive query language in this way:

"If you know SQL I can teach you Hive-QL rather quickly."

Re: HadoopDB and similar stuff

Posted by Ted Dunning <te...@gmail.com>.

I don't need to be amazed.  I am a strong proponent of map-reduce.  People
forget, but I bought the beer at the first Hadoop summit at Gordon Biersch.

I just don't think that selling Hive or Pig as SQL is fair to the buyer or
the seller.  They aren't the same and have very different virtues.

On Tue, Sep 15, 2009 at 8:34 AM, Edward Capriolo <ed...@gmail.com>wrote:

> Hive is not 100% SQL, but I would say join the Hive user list and be
> amazed. New types of joins, theta-join, etc have been added by user
> request. Most of the time if you can't do something you would expect
> to do in SQL there is a work around.
>
> The flip side is true as well, Hive has specific support that other
> databases don't :)
>

-- 
Ted Dunning, CTO
DeepDyve

Re: HadoopDB and similar stuff

Posted by Edward Capriolo <ed...@gmail.com>.

On Tue, Sep 15, 2009 at 10:28 AM, Jeff Hammerbacher <ha...@cloudera.com> wrote:
> Hey Ted,
> I don't want to derail this thread, but I would like to correct any
> misperceptions which may exist in the community.
>
> 1) HiveQL intends to include SQL as a subset of its syntax: see the VLDB
> paper for more (
> http://www.slideshare.net/namit_jain/hive-demo-paper-at-vldb-2009). As it
> stands today, a reasonable subset of SQL is already supported, and most
> users of MySQL, Oracle, or PostgreSQL will be able to work comfortably in
> Hive today.
>
> 2) There's a patch for SQL support in Pig:
> http://issues.apache.org/jira/browse/PIG-824.
>
> Every database implements a different dialect of SQL (e.g. express a Top K
> query in your favorite database and compare to the rest), and the Pig and
> HiveQL dialects are as valid as any other. If you disagree, I'd love to hear
> your perspective on why these languages are "not SQL".
>
> Regards,
> Jeff
>
> On Tue, Sep 15, 2009 at 12:42 AM, Ted Dunning <te...@gmail.com> wrote:
>
>> uhhh... neither pig nor hive are really SQL.  Higher level of abstraction
>> than pure MR, but not SQL.
>>
>> You are right to include Greenplum, though.  They slipped my mind, probably
>> because they don't have a google ad running everything 30 seconds like
>> Aster
>> does.
>>
>> On Mon, Sep 14, 2009 at 11:21 PM, Jeff Hammerbacher <hammer@cloudera.com
>> >wrote:
>>
>> > >
>> > > Do you want to tightly integrate SQL and map-reduce?  Asterdata has a
>> > > product that might help you.
>> > >
>> >
>> > As does Greenplum. You could also get this functionality from Pig or
>> Hive,
>> > which are Apache 2.0-licensed subprojects of Hadoop.
>>
>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>

I notice we have mentioned greenplum and aster. I will speak to the
fact that I have never used either product, but I have spoken to some
sales reps over the years who are very helpful, I might add.

* caveat: I am not saying that my price information is accurate or current

But the major deal breaker at my old places of employment was always
cost. Per TB pricing was a major deal breaker for US. We wanted to
keep our data indefinitely but most reporting is month-over-month. So
having to keep all our data (that we don't really use that much after
two months) in a system that charges by TB was expensive and would
become more expensive as our data set grows.

In the solution space you get a lot of bank for your buck
(hadoop+hive) vs (TeraData, GreenPlum, Aster), as you know the price
of Hadoop+hive (0+0) plus hardware.

Hive is not 100% SQL, but I would say join the Hive user list and be
amazed. New types of joins, theta-join, etc have been added by user
request. Most of the time if you can't do something you would expect
to do in SQL there is a work around.

The flip side is true as well, Hive has specific support that other
databases don't :)

Re: HadoopDB and similar stuff

Posted by Amr Awadallah <aa...@cloudera.com>.

Ted,

  Just out of curiosity, did you use asterdata or greenplum before? Is 
their SQL 100% compliant with SQL92? (not to mention SQL2008)

-- amr

Ted Dunning wrote:
> On Tue, Sep 15, 2009 at 7:28 AM, Jeff Hammerbacher <ha...@cloudera.com>wrote:
>
>   
>> ... I would like to correct any
>> misperceptions which may exist in the community.
>>
>> 1) HiveQL intends to include SQL as a subset of its syntax: see the VLDB
>> paper for more (
>> http://www.slideshare.net/namit_jain/hive-demo-paper-at-vldb-2009). As it
>> stands today, a reasonable subset of SQL is already supported, and most
>> users of MySQL, Oracle, or PostgreSQL will be able to work comfortably in
>> Hive today.
>>
>>     
>
> Note the key word "intends".  That indicates future tense.
>
> As you say, it is a reasonable subset.  I don't know the I am sure that
> there are wide swaths of SQL semantics that are not implemented.
> Transactions, rollback, fancy outer joins, exactly correct syntax for null,
> row updates and deletions are areas that I would expect deficiencies
> relative to SQL.  Conversely, I doubt that there are many Hive programs that
> could run without major alterations on conventional SQL engines.
>
> The result is that HiveQL != SQL.  It is more correct to say HiveQL =kinda=
> SQL.
>
> 2) There's a patch for SQL support in Pig:
>   
>> http://issues.apache.org/jira/browse/PIG-824.
>>
>>     
>
> More future tense.  This is hardly part of Pig at this point.  I expect that
> this will come closer to SQL than the current HiveQL, but it is likely to
> not have key semantic properties due to the properties of the substrate and
> also have some important additions.
>
> Every database implements a different dialect of SQL (e.g. express a Top K
>   
>> query in your favorite database and compare to the rest), and the Pig and
>> HiveQL dialects are as valid as any other.
>>     
>
>
> This level of cultural relativism is a bit disingenuous.  My point is that
> you are setting up unreasonable expectations.  MR based systems are
> inherently very different from traditional databases (which is, of course,
> the POINT of having MR).  SQL is very strongly tied to the underlying row
> update and transactional semantics of traditional databases.
>
> I am NOT saying that Hive and Pig are not useful.  For many things, I prefer
> them to SQL-based systems.  I am just saying that they are different
> animals.
>
> I am also NOT saying that Hive and Pig aren't a good way for SQL based
> programmers to transition to map-reduce.  I am just saying that you should
> tell people that Hive and Pig are similar to SQL so you don't have their
> heads explode when they realize that it isn't really SQL.
>
> Remember that many, many people claimed that myIsam tables are not really
> SQL.  Hive is a darned site further from SQL than that.
>
>

Re: HadoopDB and similar stuff

Posted by CubicDesign <cu...@gmail.com>.


Ted Dunning wrote:

> I am just saying that you should tell people that Hive and Pig are similar to SQL so you don't have their heads explode when they realize that it isn't really SQL.


Probably for SOME people that are starting a new project (like me) this 
is less relevant. I don't need to convert exiting code or framework to 
make it work with Hive/Pig/HBase. I need to build a new on from scratches.

Re: HadoopDB and similar stuff

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Sep 15, 2009 at 7:28 AM, Jeff Hammerbacher <ha...@cloudera.com>wrote:

> ... I would like to correct any
> misperceptions which may exist in the community.
>
> 1) HiveQL intends to include SQL as a subset of its syntax: see the VLDB
> paper for more (
> http://www.slideshare.net/namit_jain/hive-demo-paper-at-vldb-2009). As it
> stands today, a reasonable subset of SQL is already supported, and most
> users of MySQL, Oracle, or PostgreSQL will be able to work comfortably in
> Hive today.
>

Note the key word "intends".  That indicates future tense.

As you say, it is a reasonable subset.  I don't know the I am sure that
there are wide swaths of SQL semantics that are not implemented.
Transactions, rollback, fancy outer joins, exactly correct syntax for null,
row updates and deletions are areas that I would expect deficiencies
relative to SQL.  Conversely, I doubt that there are many Hive programs that
could run without major alterations on conventional SQL engines.

The result is that HiveQL != SQL.  It is more correct to say HiveQL =kinda=
SQL.

2) There's a patch for SQL support in Pig:
> http://issues.apache.org/jira/browse/PIG-824.
>

More future tense.  This is hardly part of Pig at this point.  I expect that
this will come closer to SQL than the current HiveQL, but it is likely to
not have key semantic properties due to the properties of the substrate and
also have some important additions.

Every database implements a different dialect of SQL (e.g. express a Top K
> query in your favorite database and compare to the rest), and the Pig and
> HiveQL dialects are as valid as any other.

This level of cultural relativism is a bit disingenuous.  My point is that
you are setting up unreasonable expectations.  MR based systems are
inherently very different from traditional databases (which is, of course,
the POINT of having MR).  SQL is very strongly tied to the underlying row
update and transactional semantics of traditional databases.

I am NOT saying that Hive and Pig are not useful.  For many things, I prefer
them to SQL-based systems.  I am just saying that they are different
animals.

I am also NOT saying that Hive and Pig aren't a good way for SQL based
programmers to transition to map-reduce.  I am just saying that you should
tell people that Hive and Pig are similar to SQL so you don't have their
heads explode when they realize that it isn't really SQL.

Remember that many, many people claimed that myIsam tables are not really
SQL.  Hive is a darned site further from SQL than that.

Re: HadoopDB and similar stuff

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

Hey Ted,
I don't want to derail this thread, but I would like to correct any
misperceptions which may exist in the community.

1) HiveQL intends to include SQL as a subset of its syntax: see the VLDB
paper for more (
http://www.slideshare.net/namit_jain/hive-demo-paper-at-vldb-2009). As it
stands today, a reasonable subset of SQL is already supported, and most
users of MySQL, Oracle, or PostgreSQL will be able to work comfortably in
Hive today.

2) There's a patch for SQL support in Pig:
http://issues.apache.org/jira/browse/PIG-824.

Every database implements a different dialect of SQL (e.g. express a Top K
query in your favorite database and compare to the rest), and the Pig and
HiveQL dialects are as valid as any other. If you disagree, I'd love to hear
your perspective on why these languages are "not SQL".

Regards,
Jeff

On Tue, Sep 15, 2009 at 12:42 AM, Ted Dunning <te...@gmail.com> wrote:

> uhhh... neither pig nor hive are really SQL.  Higher level of abstraction
> than pure MR, but not SQL.
>
> You are right to include Greenplum, though.  They slipped my mind, probably
> because they don't have a google ad running everything 30 seconds like
> Aster
> does.
>
> On Mon, Sep 14, 2009 at 11:21 PM, Jeff Hammerbacher <hammer@cloudera.com
> >wrote:
>
> > >
> > > Do you want to tightly integrate SQL and map-reduce?  Asterdata has a
> > > product that might help you.
> > >
> >
> > As does Greenplum. You could also get this functionality from Pig or
> Hive,
> > which are Apache 2.0-licensed subprojects of Hadoop.
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: HadoopDB and similar stuff

Posted by Ted Dunning <te...@gmail.com>.

uhhh... neither pig nor hive are really SQL.  Higher level of abstraction
than pure MR, but not SQL.

You are right to include Greenplum, though.  They slipped my mind, probably
because they don't have a google ad running everything 30 seconds like Aster
does.

On Mon, Sep 14, 2009 at 11:21 PM, Jeff Hammerbacher <ha...@cloudera.com>wrote:

> >
> > Do you want to tightly integrate SQL and map-reduce?  Asterdata has a
> > product that might help you.
> >
>
> As does Greenplum. You could also get this functionality from Pig or Hive,
> which are Apache 2.0-licensed subprojects of Hadoop.

-- 
Ted Dunning, CTO
DeepDyve

Re: HadoopDB and similar stuff

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

>
> Do you want to tightly integrate SQL and map-reduce?  Asterdata has a
> product that might help you.
>

As does Greenplum. You could also get this functionality from Pig or Hive,
which are Apache 2.0-licensed subprojects of Hadoop.

Re: HadoopDB and similar stuff

Posted by Ted Dunning <te...@gmail.com>.

You don't really say what you want here.

Do you want a database that lives in hadoop's file storage?  Hbase is the
closest for that.

Do you want to be able to import or export data from hadoop to database?
There is a db input/output format (or three) that could help with that.
Cloudera has their sqoop software as well.

Do you want to tightly integrate SQL and map-reduce?  Asterdata has a
product that might help you.

Did you mean something else entirely?  Ask again with more details about
what you really want and I am sure somebody will help you out.

On Mon, Sep 14, 2009 at 2:03 PM, CubicDesign <cu...@gmail.com> wrote:

> Hi.
>
> Anybody has experience a DB that can handle large amounts of data on top of
> Hadoop?
> HBase and Hive is nice but they also lack of some features. HadoopDB seems
> to bring some equilibrium. However, it seems to be still an infant project.
>
> Any thoughts?
>
>

-- 
Ted Dunning, CTO
DeepDyve

Re: HadoopDB and similar stuff

Posted by CubicDesign <cu...@gmail.com>.

> What kind of features are you looking for?
>   

 Hi.

We want to use Hadoop (Streaming) to run some tools to process over 1 
million entries per job. Each tool will output one string so we will 
have 1 mil outputs also. Each string (probably 5KB to 50KB length) will 
be parsed and from this parsing will result about 25-30 columns). There 
may be several jobs per day.

We need to collect the output of these tools and store it somewhere for 
later analysis. The results of one job need to be together - like in one 
table.

So, we need a DB that can store over one million rows (hmm... or 
columns?) per table and support some nice (SQL) interrogations. A 
Hadoop-oriented DB will be nice because it can store safely data (fault 
tolerant) and because it is distributed we won't have bottlenecks like 
we have with the current MySQL DB.

Re: HadoopDB and similar stuff

Posted by Amandeep Khurana <am...@gmail.com>.

HadoopDB is not a DB on top of Hadoop. Its more like doing map reduce over
database instances rather than hdfs...

HBase is the most stable structured storage layer available over Hadoop..

What kind of features are you looking for?

On Mon, Sep 14, 2009 at 2:03 PM, CubicDesign <cu...@gmail.com> wrote:

> Hi.
>
> Anybody has experience a DB that can handle large amounts of data on top of
> Hadoop?
> HBase and Hive is nice but they also lack of some features. HadoopDB seems
> to bring some equilibrium. However, it seems to be still an infant project.
>
> Any thoughts?
>
>

RE: HadoopDB and similar stuff

Posted by Omer Trajman <om...@vertica.com>.

The closest thing that's stable may be DBInputFormat, which allows you
to Map/Reduce on data that's in a database and also query the same
database via the native SQL interface.  In this case the DB sits under
or next to hadoop.

[shameless-plug] Vertica has an optimized VerticaInput/OutputFormat
based on DBInputFormat that can handle large amounts of data
[/shameless-plug]

-----Original Message-----
From: CubicDesign [mailto:cubicdesign@gmail.com] 
Sent: Monday, September 14, 2009 5:04 PM
To: common-user@hadoop.apache.org
Subject: HadoopDB and similar stuff

Hi.

Anybody has experience a DB that can handle large amounts of data on top

of Hadoop?
HBase and Hive is nice but they also lack of some features. HadoopDB 
seems to bring some equilibrium. However, it seems to be still an infant

project.

Any thoughts?