You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Andrew Purtell <ap...@apache.org> on 2009/08/25 12:17:12 UTC

HBase mention in VLDB keynote

In this keynote address here at VLDB 2009 (http://vldb2009.org/?q=node/22) Raghu Ramakrishnan, Yahoo! Research's Chief Scientist, made prominent mention of HBase, much to my surprise (and later chagrin). This happened near the end of the talk when a number of the new elastic/scalable/"nosql" storage systems were discussed to make concrete some of the architectural and data model points made earlier. The alternatives considered were Yahoo's PNUTS, sharded MySQL, HBase, and Cassandra. I don't know what version of HBase was used exactly but unfortunately the message was "not ready yet". Perhaps it was a configuration or provisioning issue but HBase did not really survive the evaluation, leading to short hyperbolic performance curves terminating on the far left of the various graphs. This was quite disappointing to see as the other alternatives were apparently successfully tested on what can be presumed to be the same resources. It stands to reason there is
opportunity for HBase to improve here if only we know what that is. It was also a little disappointing that it appears through a mailing list search that these issues were not brought to either hbase-dev@ or hbase-users@, only a minor question relating to the REST interface. Perhaps the community could have identified a specific configuration problem, recommended a correction for a deployment/provisioning error, or resolved a bug. To future evaluators of HBase, on behalf of the community I humbly request that you share you results, good or bad, so we can take the feedback, or the bug reports and their artifacts (logs, etc.) and improve our software.

At least, the story has already changed from what was presented today -- for example, the multimaster architecture of 0.20 was not presented, rather the older one (circa 0.19); and JG's/Ryan's performance test results for 0.20 stand as a contradiction. We should look into opportunities to produce a peer reviewed positive contribution. I think we have opportunities to take some novel approaches in the system itself and/or produce a novel vertical contribution and 0.20 is a good substrate for that.

Though this was unfortunately a missed opportunity for a good showing for HBase in particular, the keynote in general was a well formulated introduction of the emerging area of "cloud scale" storage / "nosql" systems to the largest elite gathering of database and data processing researchers in the world. The presentation was importantly also a call for participation in the future development and directions of the new and growing "nosql" constellation. Such participation, whether it is specific involvement with the HBase project or not, would be and is most welcome as the problems of serving data at very large scale under "cloud" constraints is an area of both significant challenge and significant promise. HBase like other projects in this area are in an early stage of development. They cover the use cases of their creators but, as answers to the larger set of problems, they are not -- that space is untapped and only waiting for creativity and effort. I
think I can speak for HBase in particular, we welcome this and would be pleased to assist at every opportunity.

- Andy

Re: HBase mention in VLDB keynote

Posted by stack <st...@duboce.net>.

On Tue, Aug 25, 2009 at 2:01 PM, Schubert Zhang <zs...@gmail.com> wrote:

> @stack
> We know HIVE-705, and already have good communication with the contributor,
> since we are all chinese. :-)
> In fact some code of the patch are used and tested in our project. But we
> need more flexible data store schema to resolve engineering problems,
> especially performance and practicability.

Pardon me Schubert.  I read through all of the issue last night and only
then noticed you have already made contribution.
Good stuff,
St.Ack

Re: HBase mention in VLDB keynote

Posted by Andrew Purtell <ap...@apache.org>.

Right, the point I was making is not about absolute numbers but the scale of the test and successful results at that scale. I would think that is on par with the (failed) experimentation at Yahoo, but have yet to see the evaluation materials posted anywhere.

   - Andy





________________________________
From: Jonathan Gray <jl...@streamy.com>
To: hbase-user@hadoop.apache.org
Sent: Tuesday, August 25, 2009 11:08:17 PM
Subject: Re: HBase mention in VLDB keynote

If you are just looking for numbers, they can vary quite drastically 
depending on the cluster configuration, cluster hardware, jvm/gc 
configuration, dataset properties, read patterns, and load patterns. 
The ones I provided in that presentation are on a very small cluster but 
with simple data and low load, my attempt at some getting some base numbers.

You really need to load up some of your own data and see how it behaves 
on your own cluster.  And tuning is increasingly important now as we are 
limited by Java GC quite a bit.

JG

Schubert Zhang wrote:
> @stack
> We know HIVE-705, and already have good communication with the contributor,
> since we are all chinese. :-)
> In fact some code of the patch are used and tested in our project. But we
> need more flexible data store schema to resolve engineering problems,
> especially performance and practicability.
> 
> @andy
> Does ryan's result different from JG's?
> On Wed, Aug 26, 2009 at 2:50 AM, Andrew Purtell <ap...@apache.org> wrote:
> 
>> Hi Schubert,
>>
>>
>>> Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
>> contradiction." Can you provide more references? such as a url/link of these
>> contradiction?
>>
>> For JG: http://www.docstoc.com/docs/7493304/HBase-Goes-Realtime
>>
>> I'm sure you have seen this already.
>>
>> Ryan has posted some information on the list now and again.
>>
>> Also I think your work with performance evaluation is very important
>> feedback and data points. Thanks for that.
>>
>>> We are doing a interesting thing to make Hive can use HBase as it's data
>> store. Now we can use Hive's SQL to query/mapreduce data stored in HBase,
>> and also we can directly query/scan data from HBase.
>>
>> That sounds REALLY interesting!
>>
>>   - Andy
>>
>>
>>
>>
>> ________________________________
>> From: Schubert Zhang <zs...@gmail.com>
>> To: hbase-user@hadoop.apache.org
>> Sent: Tuesday, August 25, 2009 8:26:50 PM
>>  Subject: Re: HBase mention in VLDB keynote
>>
>> hi andy,
>>
>> Even though current HBase is not yet ready for production, but we know it
>> is
>> really testable and evaluation-able for its data model and architecture.
>>
>> Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
>> contradiction." Can you provide more references? such as a url/link of
>> these
>> contradiction?
>>
>> Regards Hive, it's really a good design, especially about its abatraction
>> of
>> MapReduce workflow matched to SQL. Hive made a good success inside
>> Facebook, the report says 29% of Facebook employees use Hive, and 51% of
>> those users are from outside engineering. It should be caused by the easy
>> leaned SQL than other languages such as Pig Latin, etc. In fact, Pig is now
>> adding features of metadata and sql, which are provided in Hive. But Hive
>> is
>> still not very flexible to use alternate data store than HDFS files. We are
>> doing a interesting thing to make Hive can use HBase as it's data store.
>> Now
>> we can use Hive's SQL to query/mapreduce data stored in HBase, and also we
>> can directly query/scan data from HBase.
>>
>> I believe HBase can be a data store to work as a storage adapter layer
>> above
>> HDFS. It is not a database, it is just a data storage adapter system above
>> HDFS, with a distributed b-tree clustered index. BigTable is designed to
>> provide more easy-used ways to store small data objects and provide
>> random-access, since GFS is designed for
>> sequential-access/batch-processing/large-data storage and GFS is not
>> appropriate to store small data objects and random-access.
>>
>> I also believe HBase can be a data store to let MapReduce over HBase
>> possiable. If we review the Bigtable paper's, especially secetor 8, we can
>> find it is widely used for to do mapreduce analysis/summary in many google
>> applications.
>>
>>
>> In the recent ACM Queue interview to Sean Quinlan, Google GFS leader, we
>> can
>> find google's new GFS integrated some data models of Bigtable.
>> http://queue.acm.org/detail.cfm?id=1594206
>>
>>
>> Schubert
>>
>> On Wed, Aug 26, 2009 at 12:36 AM, Bradford Stephens <
>> bradfordstephens@gmail.com> wrote:
>>
>>> Interesting. I need to see what sort of eval was going on for that
>>> presentation...
>>>
>>> He probably forgot to tweak GC :)
>>>
>>> On Tue, Aug 25, 2009 at 9:32 AM, Andrew Purtell <ap...@apache.org>
>>> wrote:
>>>
>>>>> Can we write him to figure more on how evaluation was done?
>>>>
>>>> This was one interaction with that group, maybe the only other one
>> aside
>>>> from a question about sizing memstore:
>>>> http://osdir.com/ml/hbase-user-hadoop-apache/2009-07/msg00552.html
>>>> Now I wonder if the eval was done via the REST gateway... A followup
>>> might
>>>> be useful. If I run into someone from Yahoo Research here I'll ask.
>>>> Otherwise we should try mailing them, yes.
>>>>
>>>>> Should we try and get into VLDB next year?
>>>> We can certainly submit a candidate paper given a novel contribution of
>>>> some kind which moves the state of the art forward. There are other
>>> venues
>>>> besides VLDB also we can consider. Regardless, I think one of us should
>>>> attend VLDB every year.
>>>>
>>>>> Any thing else interesting at the conference?
>>>> Yes.
>>>>
>>>> ETH Zurich presented a system which tailors consistency to the needs of
>>>> various data items -- "consistency rationing in the cloud: pay only
>> when
>>> it
>>>> matters" -- choosing eventual (session) consistency or pessimistic 2PC
>> on
>>>> demand according to a cost model, with good results. Made me think of
>>>> possibilities with THBase. Also, I watched a demo of HIVE, something I
>>>> hadn't see to date. Their query planner and mapreduce scheduler is
>>>> interesting in concept and in detail. We're looking at Cascading for
>>> batch
>>>> analytics on top of HBase instead, but knowing more about alternatives
>> is
>>>> always good.
>>>>
>>>> The Hadoop-y track is really tomorrow.
>>>>
>>>> Outside of direct relevance to things HBase I attended talks on aspects
>>> of
>>>> data fusion, ETL, and complex event processing / stream processing,
>>> wearing
>>>> my TM hat. Lots of good stuff here.
>>>>
>>>>   - Andy
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Stack <sa...@gmail.com>
>>>> To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
>>>> Sent: Tuesday, August 25, 2009 4:47:57 PM
>>>> Subject: Re: HBase mention in VLDB keynote
>>>>
>>>> The same fella did keynote at apachecon eu on a similar topic.  Then he
>>>> talked mostly of Sherpa/pnuts yahoo tech.   In that presentation we got
>>> no
>>>> mention.  There the comparison strangely was to couchdb and perhaps
>>>> Cassandra (iirc).
>>>>
>>>> So, mention is an improvement (do you think the kick up the behind I
>>>> rendered him after his amsterdam talk could have had anything to do
>> with
>>>> it?).
>>>>
>>>> Can we write him to figure more on how evaluation was done?
>>>>
>>>> Should we try and get into vldb next year?
>>>>
>>>> Good stuff Andy.  Any thing else interesting at the conference?
>>>>
>>>> Stack
>>>>
>>>>
>>>>
>>>> On Aug 25, 2009, at 6:17 AM, Andrew Purtell <ap...@apache.org>
>> wrote:
>>>>> In this keynote address here at VLDB 2009 (
>>>> http://vldb2009.org/?q=node/22) Raghu Ramakrishnan, Yahoo! Research's
>>>> Chief Scientist, made prominent mention of HBase, much to my surprise
>>> (and
>>>> later chagrin). This happened near the end of the talk when a number of
>>> the
>>>> new elastic/scalable/"nosql" storage systems were discussed to make
>>> concrete
>>>> some of the architectural and data model points made earlier. The
>>>> alternatives considered were Yahoo's PNUTS, sharded MySQL, HBase, and
>>>> Cassandra. I don't know what version of HBase was used exactly but
>>>> unfortunately the message was "not ready yet". Perhaps it was a
>>>> configuration or provisioning issue but HBase did not really survive
>> the
>>>> evaluation, leading to short hyperbolic performance curves terminating
>> on
>>>> the far left of the various graphs. This was quite disappointing to see
>>> as
>>>> the other alternatives were apparently successfully tested on what can
>> be
>>>> presumed to be the same resources. It stands to reason there
>>>>  is opportunity for HBase to improve here if only we know what that is.
>>> It
>>>> was also a little disappointing that it appears through a mailing list
>>>> search that these issues were not brought to either hbase-dev@ or
>>>> hbase-users@, only a minor question relating to the REST interface.
>>>> Perhaps the community could have identified a specific configuration
>>>> problem, recommended a correction for a deployment/provisioning error,
>> or
>>>> resolved a bug. To future evaluators of HBase, on behalf of the
>> community
>>> I
>>>> humbly request that you share you results, good or bad, so we can take
>>> the
>>>> feedback, or the bug reports and their artifacts (logs, etc.) and
>> improve
>>>> our software.
>>>>> At least, the story has already changed from what was presented today
>>> --
>>>> for example, the multimaster architecture of 0.20 was not presented,
>>> rather
>>>> the older one (circa 0.19); and JG's/Ryan's performance test results
>> for
>>>> 0.20 stand as a contradiction. We should look into opportunities to
>>> produce
>>>> a peer reviewed positive contribution. I think we have opportunities to
>>> take
>>>> some novel approaches in the system itself and/or produce a novel
>>> vertical
>>>> contribution and 0.20 is a good substrate for that.
>>>>> Though this was unfortunately a missed opportunity for a good showing
>>> for
>>>> HBase in particular, the keynote in general was a well formulated
>>>> introduction of the emerging area of "cloud scale" storage / "nosql"
>>> systems
>>>> to the largest elite gathering of database and data processing
>>> researchers
>>>> in the world. The presentation was importantly also a call for
>>> participation
>>>> in the future development and directions of the new and growing "nosql"
>>>> constellation. Such participation, whether it is specific involvement
>>> with
>>>> the HBase project or not, would be and is most welcome as the problems
>> of
>>>> serving data at very large scale under "cloud" constraints is an area
>> of
>>>> both significant challenge and significant promise. HBase like other
>>>> projects in this area are in an early stage of development. They cover
>>> the
>>>> use cases of their creators but, as answers to the larger set of
>>> problems,
>>>> they are not -- that space is untapped and only waiting for creativity
>>> and
>>>> effort. I
>>>>  think I can speak for HBase in particular, we welcome this and would
>> be
>>>> pleased to assist at every opportunity.
>>>>>    - Andy
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> http://www.roadtofailure.com -- The Fringes of Scalability, Social
>> Media,
>>> and Computer Science
>>>
>>
>>
>>
>>
>

Re: HBase mention in VLDB keynote

Posted by stack <st...@duboce.net>.

On Tue, Aug 25, 2009 at 7:05 PM, Schubert Zhang <zs...@gmail.com> wrote:

> Thanks JG. We are trying to load up our datasets now.  But one thing's for
> sure that the cluster will become slow while dataset become larger and
> larger. It is distinct on writes and random read.

What kinda of sizes are you talking of Schubert and can you figure where the
slowdown is?
St.Ack

Re: HBase mention in VLDB keynote

Posted by Schubert Zhang <zs...@gmail.com>.

Thanks JG. We are trying to load up our datasets now.  But one thing's for
sure that the cluster will become slow while dataset become larger and
larger. It is distinct on writes and random read.


On Wed, Aug 26, 2009 at 5:08 AM, Jonathan Gray <jl...@streamy.com> wrote:

> If you are just looking for numbers, they can vary quite drastically
> depending on the cluster configuration, cluster hardware, jvm/gc
> configuration, dataset properties, read patterns, and load patterns. The
> ones I provided in that presentation are on a very small cluster but with
> simple data and low load, my attempt at some getting some base numbers.
>
> You really need to load up some of your own data and see how it behaves on
> your own cluster.  And tuning is increasingly important now as we are
> limited by Java GC quite a bit.
>
> JG
>
>
> Schubert Zhang wrote:
>
>> @stack
>> We know HIVE-705, and already have good communication with the
>> contributor,
>> since we are all chinese. :-)
>> In fact some code of the patch are used and tested in our project. But we
>> need more flexible data store schema to resolve engineering problems,
>> especially performance and practicability.
>>
>> @andy
>> Does ryan's result different from JG's?
>> On Wed, Aug 26, 2009 at 2:50 AM, Andrew Purtell <ap...@apache.org>
>> wrote:
>>
>> Hi Schubert,
>>>
>>>
>>> Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
>>>>
>>> contradiction." Can you provide more references? such as a url/link of
>>> these
>>> contradiction?
>>>
>>> For JG: http://www.docstoc.com/docs/7493304/HBase-Goes-Realtime
>>>
>>> I'm sure you have seen this already.
>>>
>>> Ryan has posted some information on the list now and again.
>>>
>>> Also I think your work with performance evaluation is very important
>>> feedback and data points. Thanks for that.
>>>
>>> We are doing a interesting thing to make Hive can use HBase as it's data
>>>>
>>> store. Now we can use Hive's SQL to query/mapreduce data stored in HBase,
>>> and also we can directly query/scan data from HBase.
>>>
>>> That sounds REALLY interesting!
>>>
>>>  - Andy
>>>
>>>
>>>
>>>
>>> ________________________________
>>> From: Schubert Zhang <zs...@gmail.com>
>>> To: hbase-user@hadoop.apache.org
>>> Sent: Tuesday, August 25, 2009 8:26:50 PM
>>>  Subject: Re: HBase mention in VLDB keynote
>>>
>>> hi andy,
>>>
>>> Even though current HBase is not yet ready for production, but we know it
>>> is
>>> really testable and evaluation-able for its data model and architecture.
>>>
>>> Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
>>> contradiction." Can you provide more references? such as a url/link of
>>> these
>>> contradiction?
>>>
>>> Regards Hive, it's really a good design, especially about its abatraction
>>> of
>>> MapReduce workflow matched to SQL. Hive made a good success inside
>>> Facebook, the report says 29% of Facebook employees use Hive, and 51% of
>>> those users are from outside engineering. It should be caused by the easy
>>> leaned SQL than other languages such as Pig Latin, etc. In fact, Pig is
>>> now
>>> adding features of metadata and sql, which are provided in Hive. But Hive
>>> is
>>> still not very flexible to use alternate data store than HDFS files. We
>>> are
>>> doing a interesting thing to make Hive can use HBase as it's data store.
>>> Now
>>> we can use Hive's SQL to query/mapreduce data stored in HBase, and also
>>> we
>>> can directly query/scan data from HBase.
>>>
>>> I believe HBase can be a data store to work as a storage adapter layer
>>> above
>>> HDFS. It is not a database, it is just a data storage adapter system
>>> above
>>> HDFS, with a distributed b-tree clustered index. BigTable is designed to
>>> provide more easy-used ways to store small data objects and provide
>>> random-access, since GFS is designed for
>>> sequential-access/batch-processing/large-data storage and GFS is not
>>> appropriate to store small data objects and random-access.
>>>
>>> I also believe HBase can be a data store to let MapReduce over HBase
>>> possiable. If we review the Bigtable paper's, especially secetor 8, we
>>> can
>>> find it is widely used for to do mapreduce analysis/summary in many
>>> google
>>> applications.
>>>
>>>
>>> In the recent ACM Queue interview to Sean Quinlan, Google GFS leader, we
>>> can
>>> find google's new GFS integrated some data models of Bigtable.
>>> http://queue.acm.org/detail.cfm?id=1594206
>>>
>>>
>>> Schubert
>>>
>>> On Wed, Aug 26, 2009 at 12:36 AM, Bradford Stephens <
>>> bradfordstephens@gmail.com> wrote:
>>>
>>> Interesting. I need to see what sort of eval was going on for that
>>>> presentation...
>>>>
>>>> He probably forgot to tweak GC :)
>>>>
>>>> On Tue, Aug 25, 2009 at 9:32 AM, Andrew Purtell <ap...@apache.org>
>>>> wrote:
>>>>
>>>>  Can we write him to figure more on how evaluation was done?
>>>>>>
>>>>>
>>>>> This was one interaction with that group, maybe the only other one
>>>>>
>>>> aside
>>>
>>>> from a question about sizing memstore:
>>>>> http://osdir.com/ml/hbase-user-hadoop-apache/2009-07/msg00552.html
>>>>> Now I wonder if the eval was done via the REST gateway... A followup
>>>>>
>>>> might
>>>>
>>>>> be useful. If I run into someone from Yahoo Research here I'll ask.
>>>>> Otherwise we should try mailing them, yes.
>>>>>
>>>>> Should we try and get into VLDB next year?
>>>>>>
>>>>> We can certainly submit a candidate paper given a novel contribution of
>>>>> some kind which moves the state of the art forward. There are other
>>>>>
>>>> venues
>>>>
>>>>> besides VLDB also we can consider. Regardless, I think one of us should
>>>>> attend VLDB every year.
>>>>>
>>>>> Any thing else interesting at the conference?
>>>>>>
>>>>> Yes.
>>>>>
>>>>> ETH Zurich presented a system which tailors consistency to the needs of
>>>>> various data items -- "consistency rationing in the cloud: pay only
>>>>>
>>>> when
>>>
>>>> it
>>>>
>>>>> matters" -- choosing eventual (session) consistency or pessimistic 2PC
>>>>>
>>>> on
>>>
>>>> demand according to a cost model, with good results. Made me think of
>>>>> possibilities with THBase. Also, I watched a demo of HIVE, something I
>>>>> hadn't see to date. Their query planner and mapreduce scheduler is
>>>>> interesting in concept and in detail. We're looking at Cascading for
>>>>>
>>>> batch
>>>>
>>>>> analytics on top of HBase instead, but knowing more about alternatives
>>>>>
>>>> is
>>>
>>>> always good.
>>>>>
>>>>> The Hadoop-y track is really tomorrow.
>>>>>
>>>>> Outside of direct relevance to things HBase I attended talks on aspects
>>>>>
>>>> of
>>>>
>>>>> data fusion, ETL, and complex event processing / stream processing,
>>>>>
>>>> wearing
>>>>
>>>>> my TM hat. Lots of good stuff here.
>>>>>
>>>>>  - Andy
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Stack <sa...@gmail.com>
>>>>> To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
>>>>> Sent: Tuesday, August 25, 2009 4:47:57 PM
>>>>> Subject: Re: HBase mention in VLDB keynote
>>>>>
>>>>> The same fella did keynote at apachecon eu on a similar topic.  Then he
>>>>> talked mostly of Sherpa/pnuts yahoo tech.   In that presentation we got
>>>>>
>>>> no
>>>>
>>>>> mention.  There the comparison strangely was to couchdb and perhaps
>>>>> Cassandra (iirc).
>>>>>
>>>>> So, mention is an improvement (do you think the kick up the behind I
>>>>> rendered him after his amsterdam talk could have had anything to do
>>>>>
>>>> with
>>>
>>>> it?).
>>>>>
>>>>> Can we write him to figure more on how evaluation was done?
>>>>>
>>>>> Should we try and get into vldb next year?
>>>>>
>>>>> Good stuff Andy.  Any thing else interesting at the conference?
>>>>>
>>>>> Stack
>>>>>
>>>>>
>>>>>
>>>>> On Aug 25, 2009, at 6:17 AM, Andrew Purtell <ap...@apache.org>
>>>>>
>>>> wrote:
>>>
>>>>  In this keynote address here at VLDB 2009 (
>>>>>>
>>>>> http://vldb2009.org/?q=node/22) Raghu Ramakrishnan, Yahoo! Research's
>>>>> Chief Scientist, made prominent mention of HBase, much to my surprise
>>>>>
>>>> (and
>>>>
>>>>> later chagrin). This happened near the end of the talk when a number of
>>>>>
>>>> the
>>>>
>>>>> new elastic/scalable/"nosql" storage systems were discussed to make
>>>>>
>>>> concrete
>>>>
>>>>> some of the architectural and data model points made earlier. The
>>>>> alternatives considered were Yahoo's PNUTS, sharded MySQL, HBase, and
>>>>> Cassandra. I don't know what version of HBase was used exactly but
>>>>> unfortunately the message was "not ready yet". Perhaps it was a
>>>>> configuration or provisioning issue but HBase did not really survive
>>>>>
>>>> the
>>>
>>>> evaluation, leading to short hyperbolic performance curves terminating
>>>>>
>>>> on
>>>
>>>> the far left of the various graphs. This was quite disappointing to see
>>>>>
>>>> as
>>>>
>>>>> the other alternatives were apparently successfully tested on what can
>>>>>
>>>> be
>>>
>>>> presumed to be the same resources. It stands to reason there
>>>>>  is opportunity for HBase to improve here if only we know what that is.
>>>>>
>>>> It
>>>>
>>>>> was also a little disappointing that it appears through a mailing list
>>>>> search that these issues were not brought to either hbase-dev@ or
>>>>> hbase-users@, only a minor question relating to the REST interface.
>>>>> Perhaps the community could have identified a specific configuration
>>>>> problem, recommended a correction for a deployment/provisioning error,
>>>>>
>>>> or
>>>
>>>> resolved a bug. To future evaluators of HBase, on behalf of the
>>>>>
>>>> community
>>>
>>>> I
>>>>
>>>>> humbly request that you share you results, good or bad, so we can take
>>>>>
>>>> the
>>>>
>>>>> feedback, or the bug reports and their artifacts (logs, etc.) and
>>>>>
>>>> improve
>>>
>>>> our software.
>>>>>
>>>>>> At least, the story has already changed from what was presented today
>>>>>>
>>>>> --
>>>>
>>>>> for example, the multimaster architecture of 0.20 was not presented,
>>>>>
>>>> rather
>>>>
>>>>> the older one (circa 0.19); and JG's/Ryan's performance test results
>>>>>
>>>> for
>>>
>>>> 0.20 stand as a contradiction. We should look into opportunities to
>>>>>
>>>> produce
>>>>
>>>>> a peer reviewed positive contribution. I think we have opportunities to
>>>>>
>>>> take
>>>>
>>>>> some novel approaches in the system itself and/or produce a novel
>>>>>
>>>> vertical
>>>>
>>>>> contribution and 0.20 is a good substrate for that.
>>>>>
>>>>>> Though this was unfortunately a missed opportunity for a good showing
>>>>>>
>>>>> for
>>>>
>>>>> HBase in particular, the keynote in general was a well formulated
>>>>> introduction of the emerging area of "cloud scale" storage / "nosql"
>>>>>
>>>> systems
>>>>
>>>>> to the largest elite gathering of database and data processing
>>>>>
>>>> researchers
>>>>
>>>>> in the world. The presentation was importantly also a call for
>>>>>
>>>> participation
>>>>
>>>>> in the future development and directions of the new and growing "nosql"
>>>>> constellation. Such participation, whether it is specific involvement
>>>>>
>>>> with
>>>>
>>>>> the HBase project or not, would be and is most welcome as the problems
>>>>>
>>>> of
>>>
>>>> serving data at very large scale under "cloud" constraints is an area
>>>>>
>>>> of
>>>
>>>> both significant challenge and significant promise. HBase like other
>>>>> projects in this area are in an early stage of development. They cover
>>>>>
>>>> the
>>>>
>>>>> use cases of their creators but, as answers to the larger set of
>>>>>
>>>> problems,
>>>>
>>>>> they are not -- that space is untapped and only waiting for creativity
>>>>>
>>>> and
>>>>
>>>>> effort. I
>>>>>  think I can speak for HBase in particular, we welcome this and would
>>>>>
>>>> be
>>>
>>>> pleased to assist at every opportunity.
>>>>>
>>>>>>   - Andy
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> http://www.roadtofailure.com -- The Fringes of Scalability, Social
>>>>
>>> Media,
>>>
>>>> and Computer Science
>>>>
>>>>
>>>
>>>
>>>
>>>
>>

Re: HBase mention in VLDB keynote

Posted by Jonathan Gray <jl...@streamy.com>.

If you are just looking for numbers, they can vary quite drastically 
depending on the cluster configuration, cluster hardware, jvm/gc 
configuration, dataset properties, read patterns, and load patterns. 
The ones I provided in that presentation are on a very small cluster but 
with simple data and low load, my attempt at some getting some base numbers.

You really need to load up some of your own data and see how it behaves 
on your own cluster.  And tuning is increasingly important now as we are 
limited by Java GC quite a bit.

JG

Schubert Zhang wrote:
> @stack
> We know HIVE-705, and already have good communication with the contributor,
> since we are all chinese. :-)
> In fact some code of the patch are used and tested in our project. But we
> need more flexible data store schema to resolve engineering problems,
> especially performance and practicability.
> 
> @andy
> Does ryan's result different from JG's?
> On Wed, Aug 26, 2009 at 2:50 AM, Andrew Purtell <ap...@apache.org> wrote:
> 
>> Hi Schubert,
>>
>>
>>> Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
>> contradiction." Can you provide more references? such as a url/link of these
>> contradiction?
>>
>> For JG: http://www.docstoc.com/docs/7493304/HBase-Goes-Realtime
>>
>> I'm sure you have seen this already.
>>
>> Ryan has posted some information on the list now and again.
>>
>> Also I think your work with performance evaluation is very important
>> feedback and data points. Thanks for that.
>>
>>> We are doing a interesting thing to make Hive can use HBase as it's data
>> store. Now we can use Hive's SQL to query/mapreduce data stored in HBase,
>> and also we can directly query/scan data from HBase.
>>
>> That sounds REALLY interesting!
>>
>>   - Andy
>>
>>
>>
>>
>> ________________________________
>> From: Schubert Zhang <zs...@gmail.com>
>> To: hbase-user@hadoop.apache.org
>> Sent: Tuesday, August 25, 2009 8:26:50 PM
>>  Subject: Re: HBase mention in VLDB keynote
>>
>> hi andy,
>>
>> Even though current HBase is not yet ready for production, but we know it
>> is
>> really testable and evaluation-able for its data model and architecture.
>>
>> Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
>> contradiction." Can you provide more references? such as a url/link of
>> these
>> contradiction?
>>
>> Regards Hive, it's really a good design, especially about its abatraction
>> of
>> MapReduce workflow matched to SQL. Hive made a good success inside
>> Facebook, the report says 29% of Facebook employees use Hive, and 51% of
>> those users are from outside engineering. It should be caused by the easy
>> leaned SQL than other languages such as Pig Latin, etc. In fact, Pig is now
>> adding features of metadata and sql, which are provided in Hive. But Hive
>> is
>> still not very flexible to use alternate data store than HDFS files. We are
>> doing a interesting thing to make Hive can use HBase as it's data store.
>> Now
>> we can use Hive's SQL to query/mapreduce data stored in HBase, and also we
>> can directly query/scan data from HBase.
>>
>> I believe HBase can be a data store to work as a storage adapter layer
>> above
>> HDFS. It is not a database, it is just a data storage adapter system above
>> HDFS, with a distributed b-tree clustered index. BigTable is designed to
>> provide more easy-used ways to store small data objects and provide
>> random-access, since GFS is designed for
>> sequential-access/batch-processing/large-data storage and GFS is not
>> appropriate to store small data objects and random-access.
>>
>> I also believe HBase can be a data store to let MapReduce over HBase
>> possiable. If we review the Bigtable paper's, especially secetor 8, we can
>> find it is widely used for to do mapreduce analysis/summary in many google
>> applications.
>>
>>
>> In the recent ACM Queue interview to Sean Quinlan, Google GFS leader, we
>> can
>> find google's new GFS integrated some data models of Bigtable.
>> http://queue.acm.org/detail.cfm?id=1594206
>>
>>
>> Schubert
>>
>> On Wed, Aug 26, 2009 at 12:36 AM, Bradford Stephens <
>> bradfordstephens@gmail.com> wrote:
>>
>>> Interesting. I need to see what sort of eval was going on for that
>>> presentation...
>>>
>>> He probably forgot to tweak GC :)
>>>
>>> On Tue, Aug 25, 2009 at 9:32 AM, Andrew Purtell <ap...@apache.org>
>>> wrote:
>>>
>>>>> Can we write him to figure more on how evaluation was done?
>>>>
>>>> This was one interaction with that group, maybe the only other one
>> aside
>>>> from a question about sizing memstore:
>>>> http://osdir.com/ml/hbase-user-hadoop-apache/2009-07/msg00552.html
>>>> Now I wonder if the eval was done via the REST gateway... A followup
>>> might
>>>> be useful. If I run into someone from Yahoo Research here I'll ask.
>>>> Otherwise we should try mailing them, yes.
>>>>
>>>>> Should we try and get into VLDB next year?
>>>> We can certainly submit a candidate paper given a novel contribution of
>>>> some kind which moves the state of the art forward. There are other
>>> venues
>>>> besides VLDB also we can consider. Regardless, I think one of us should
>>>> attend VLDB every year.
>>>>
>>>>> Any thing else interesting at the conference?
>>>> Yes.
>>>>
>>>> ETH Zurich presented a system which tailors consistency to the needs of
>>>> various data items -- "consistency rationing in the cloud: pay only
>> when
>>> it
>>>> matters" -- choosing eventual (session) consistency or pessimistic 2PC
>> on
>>>> demand according to a cost model, with good results. Made me think of
>>>> possibilities with THBase. Also, I watched a demo of HIVE, something I
>>>> hadn't see to date. Their query planner and mapreduce scheduler is
>>>> interesting in concept and in detail. We're looking at Cascading for
>>> batch
>>>> analytics on top of HBase instead, but knowing more about alternatives
>> is
>>>> always good.
>>>>
>>>> The Hadoop-y track is really tomorrow.
>>>>
>>>> Outside of direct relevance to things HBase I attended talks on aspects
>>> of
>>>> data fusion, ETL, and complex event processing / stream processing,
>>> wearing
>>>> my TM hat. Lots of good stuff here.
>>>>
>>>>   - Andy
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Stack <sa...@gmail.com>
>>>> To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
>>>> Sent: Tuesday, August 25, 2009 4:47:57 PM
>>>> Subject: Re: HBase mention in VLDB keynote
>>>>
>>>> The same fella did keynote at apachecon eu on a similar topic.  Then he
>>>> talked mostly of Sherpa/pnuts yahoo tech.   In that presentation we got
>>> no
>>>> mention.  There the comparison strangely was to couchdb and perhaps
>>>> Cassandra (iirc).
>>>>
>>>> So, mention is an improvement (do you think the kick up the behind I
>>>> rendered him after his amsterdam talk could have had anything to do
>> with
>>>> it?).
>>>>
>>>> Can we write him to figure more on how evaluation was done?
>>>>
>>>> Should we try and get into vldb next year?
>>>>
>>>> Good stuff Andy.  Any thing else interesting at the conference?
>>>>
>>>> Stack
>>>>
>>>>
>>>>
>>>> On Aug 25, 2009, at 6:17 AM, Andrew Purtell <ap...@apache.org>
>> wrote:
>>>>> In this keynote address here at VLDB 2009 (
>>>> http://vldb2009.org/?q=node/22) Raghu Ramakrishnan, Yahoo! Research's
>>>> Chief Scientist, made prominent mention of HBase, much to my surprise
>>> (and
>>>> later chagrin). This happened near the end of the talk when a number of
>>> the
>>>> new elastic/scalable/"nosql" storage systems were discussed to make
>>> concrete
>>>> some of the architectural and data model points made earlier. The
>>>> alternatives considered were Yahoo's PNUTS, sharded MySQL, HBase, and
>>>> Cassandra. I don't know what version of HBase was used exactly but
>>>> unfortunately the message was "not ready yet". Perhaps it was a
>>>> configuration or provisioning issue but HBase did not really survive
>> the
>>>> evaluation, leading to short hyperbolic performance curves terminating
>> on
>>>> the far left of the various graphs. This was quite disappointing to see
>>> as
>>>> the other alternatives were apparently successfully tested on what can
>> be
>>>> presumed to be the same resources. It stands to reason there
>>>>  is opportunity for HBase to improve here if only we know what that is.
>>> It
>>>> was also a little disappointing that it appears through a mailing list
>>>> search that these issues were not brought to either hbase-dev@ or
>>>> hbase-users@, only a minor question relating to the REST interface.
>>>> Perhaps the community could have identified a specific configuration
>>>> problem, recommended a correction for a deployment/provisioning error,
>> or
>>>> resolved a bug. To future evaluators of HBase, on behalf of the
>> community
>>> I
>>>> humbly request that you share you results, good or bad, so we can take
>>> the
>>>> feedback, or the bug reports and their artifacts (logs, etc.) and
>> improve
>>>> our software.
>>>>> At least, the story has already changed from what was presented today
>>> --
>>>> for example, the multimaster architecture of 0.20 was not presented,
>>> rather
>>>> the older one (circa 0.19); and JG's/Ryan's performance test results
>> for
>>>> 0.20 stand as a contradiction. We should look into opportunities to
>>> produce
>>>> a peer reviewed positive contribution. I think we have opportunities to
>>> take
>>>> some novel approaches in the system itself and/or produce a novel
>>> vertical
>>>> contribution and 0.20 is a good substrate for that.
>>>>> Though this was unfortunately a missed opportunity for a good showing
>>> for
>>>> HBase in particular, the keynote in general was a well formulated
>>>> introduction of the emerging area of "cloud scale" storage / "nosql"
>>> systems
>>>> to the largest elite gathering of database and data processing
>>> researchers
>>>> in the world. The presentation was importantly also a call for
>>> participation
>>>> in the future development and directions of the new and growing "nosql"
>>>> constellation. Such participation, whether it is specific involvement
>>> with
>>>> the HBase project or not, would be and is most welcome as the problems
>> of
>>>> serving data at very large scale under "cloud" constraints is an area
>> of
>>>> both significant challenge and significant promise. HBase like other
>>>> projects in this area are in an early stage of development. They cover
>>> the
>>>> use cases of their creators but, as answers to the larger set of
>>> problems,
>>>> they are not -- that space is untapped and only waiting for creativity
>>> and
>>>> effort. I
>>>>  think I can speak for HBase in particular, we welcome this and would
>> be
>>>> pleased to assist at every opportunity.
>>>>>    - Andy
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> http://www.roadtofailure.com -- The Fringes of Scalability, Social
>> Media,
>>> and Computer Science
>>>
>>
>>
>>
>>
>

Re: HBase mention in VLDB keynote

Posted by Schubert Zhang <zs...@gmail.com>.

@stack
We know HIVE-705, and already have good communication with the contributor,
since we are all chinese. :-)
In fact some code of the patch are used and tested in our project. But we
need more flexible data store schema to resolve engineering problems,
especially performance and practicability.

@andy
Does ryan's result different from JG's?
On Wed, Aug 26, 2009 at 2:50 AM, Andrew Purtell <ap...@apache.org> wrote:

> Hi Schubert,
>
>
> > Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
> contradiction." Can you provide more references? such as a url/link of these
> contradiction?
>
> For JG: http://www.docstoc.com/docs/7493304/HBase-Goes-Realtime
>
> I'm sure you have seen this already.
>
> Ryan has posted some information on the list now and again.
>
> Also I think your work with performance evaluation is very important
> feedback and data points. Thanks for that.
>
> > We are doing a interesting thing to make Hive can use HBase as it's data
> store. Now we can use Hive's SQL to query/mapreduce data stored in HBase,
> and also we can directly query/scan data from HBase.
>
> That sounds REALLY interesting!
>
>   - Andy
>
>
>
>
> ________________________________
> From: Schubert Zhang <zs...@gmail.com>
> To: hbase-user@hadoop.apache.org
> Sent: Tuesday, August 25, 2009 8:26:50 PM
>  Subject: Re: HBase mention in VLDB keynote
>
> hi andy,
>
> Even though current HBase is not yet ready for production, but we know it
> is
> really testable and evaluation-able for its data model and architecture.
>
> Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
> contradiction." Can you provide more references? such as a url/link of
> these
> contradiction?
>
> Regards Hive, it's really a good design, especially about its abatraction
> of
> MapReduce workflow matched to SQL. Hive made a good success inside
> Facebook, the report says 29% of Facebook employees use Hive, and 51% of
> those users are from outside engineering. It should be caused by the easy
> leaned SQL than other languages such as Pig Latin, etc. In fact, Pig is now
> adding features of metadata and sql, which are provided in Hive. But Hive
> is
> still not very flexible to use alternate data store than HDFS files. We are
> doing a interesting thing to make Hive can use HBase as it's data store.
> Now
> we can use Hive's SQL to query/mapreduce data stored in HBase, and also we
> can directly query/scan data from HBase.
>
> I believe HBase can be a data store to work as a storage adapter layer
> above
> HDFS. It is not a database, it is just a data storage adapter system above
> HDFS, with a distributed b-tree clustered index. BigTable is designed to
> provide more easy-used ways to store small data objects and provide
> random-access, since GFS is designed for
> sequential-access/batch-processing/large-data storage and GFS is not
> appropriate to store small data objects and random-access.
>
> I also believe HBase can be a data store to let MapReduce over HBase
> possiable. If we review the Bigtable paper's, especially secetor 8, we can
> find it is widely used for to do mapreduce analysis/summary in many google
> applications.
>
>
> In the recent ACM Queue interview to Sean Quinlan, Google GFS leader, we
> can
> find google's new GFS integrated some data models of Bigtable.
> http://queue.acm.org/detail.cfm?id=1594206
>
>
> Schubert
>
> On Wed, Aug 26, 2009 at 12:36 AM, Bradford Stephens <
> bradfordstephens@gmail.com> wrote:
>
> > Interesting. I need to see what sort of eval was going on for that
> > presentation...
> >
> > He probably forgot to tweak GC :)
> >
> > On Tue, Aug 25, 2009 at 9:32 AM, Andrew Purtell <ap...@apache.org>
> > wrote:
> >
> > > > Can we write him to figure more on how evaluation was done?
> > >
> > >
> > > This was one interaction with that group, maybe the only other one
> aside
> > > from a question about sizing memstore:
> > > http://osdir.com/ml/hbase-user-hadoop-apache/2009-07/msg00552.html
> > > Now I wonder if the eval was done via the REST gateway... A followup
> > might
> > > be useful. If I run into someone from Yahoo Research here I'll ask.
> > > Otherwise we should try mailing them, yes.
> > >
> > > > Should we try and get into VLDB next year?
> > >
> > > We can certainly submit a candidate paper given a novel contribution of
> > > some kind which moves the state of the art forward. There are other
> > venues
> > > besides VLDB also we can consider. Regardless, I think one of us should
> > > attend VLDB every year.
> > >
> > > > Any thing else interesting at the conference?
> > >
> > > Yes.
> > >
> > > ETH Zurich presented a system which tailors consistency to the needs of
> > > various data items -- "consistency rationing in the cloud: pay only
> when
> > it
> > > matters" -- choosing eventual (session) consistency or pessimistic 2PC
> on
> > > demand according to a cost model, with good results. Made me think of
> > > possibilities with THBase. Also, I watched a demo of HIVE, something I
> > > hadn't see to date. Their query planner and mapreduce scheduler is
> > > interesting in concept and in detail. We're looking at Cascading for
> > batch
> > > analytics on top of HBase instead, but knowing more about alternatives
> is
> > > always good.
> > >
> > > The Hadoop-y track is really tomorrow.
> > >
> > > Outside of direct relevance to things HBase I attended talks on aspects
> > of
> > > data fusion, ETL, and complex event processing / stream processing,
> > wearing
> > > my TM hat. Lots of good stuff here.
> > >
> > >   - Andy
> > >
> > >
> > >
> > > ________________________________
> > > From: Stack <sa...@gmail.com>
> > > To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
> > > Sent: Tuesday, August 25, 2009 4:47:57 PM
> > > Subject: Re: HBase mention in VLDB keynote
> > >
> > > The same fella did keynote at apachecon eu on a similar topic.  Then he
> > > talked mostly of Sherpa/pnuts yahoo tech.   In that presentation we got
> > no
> > > mention.  There the comparison strangely was to couchdb and perhaps
> > > Cassandra (iirc).
> > >
> > > So, mention is an improvement (do you think the kick up the behind I
> > > rendered him after his amsterdam talk could have had anything to do
> with
> > > it?).
> > >
> > > Can we write him to figure more on how evaluation was done?
> > >
> > > Should we try and get into vldb next year?
> > >
> > > Good stuff Andy.  Any thing else interesting at the conference?
> > >
> > > Stack
> > >
> > >
> > >
> > > On Aug 25, 2009, at 6:17 AM, Andrew Purtell <ap...@apache.org>
> wrote:
> > >
> > > > In this keynote address here at VLDB 2009 (
> > > http://vldb2009.org/?q=node/22) Raghu Ramakrishnan, Yahoo! Research's
> > > Chief Scientist, made prominent mention of HBase, much to my surprise
> > (and
> > > later chagrin). This happened near the end of the talk when a number of
> > the
> > > new elastic/scalable/"nosql" storage systems were discussed to make
> > concrete
> > > some of the architectural and data model points made earlier. The
> > > alternatives considered were Yahoo's PNUTS, sharded MySQL, HBase, and
> > > Cassandra. I don't know what version of HBase was used exactly but
> > > unfortunately the message was "not ready yet". Perhaps it was a
> > > configuration or provisioning issue but HBase did not really survive
> the
> > > evaluation, leading to short hyperbolic performance curves terminating
> on
> > > the far left of the various graphs. This was quite disappointing to see
> > as
> > > the other alternatives were apparently successfully tested on what can
> be
> > > presumed to be the same resources. It stands to reason there
> > >  is opportunity for HBase to improve here if only we know what that is.
> > It
> > > was also a little disappointing that it appears through a mailing list
> > > search that these issues were not brought to either hbase-dev@ or
> > > hbase-users@, only a minor question relating to the REST interface.
> > > Perhaps the community could have identified a specific configuration
> > > problem, recommended a correction for a deployment/provisioning error,
> or
> > > resolved a bug. To future evaluators of HBase, on behalf of the
> community
> > I
> > > humbly request that you share you results, good or bad, so we can take
> > the
> > > feedback, or the bug reports and their artifacts (logs, etc.) and
> improve
> > > our software.
> > > >
> > > > At least, the story has already changed from what was presented today
> > --
> > > for example, the multimaster architecture of 0.20 was not presented,
> > rather
> > > the older one (circa 0.19); and JG's/Ryan's performance test results
> for
> > > 0.20 stand as a contradiction. We should look into opportunities to
> > produce
> > > a peer reviewed positive contribution. I think we have opportunities to
> > take
> > > some novel approaches in the system itself and/or produce a novel
> > vertical
> > > contribution and 0.20 is a good substrate for that.
> > > >
> > > > Though this was unfortunately a missed opportunity for a good showing
> > for
> > > HBase in particular, the keynote in general was a well formulated
> > > introduction of the emerging area of "cloud scale" storage / "nosql"
> > systems
> > > to the largest elite gathering of database and data processing
> > researchers
> > > in the world. The presentation was importantly also a call for
> > participation
> > > in the future development and directions of the new and growing "nosql"
> > > constellation. Such participation, whether it is specific involvement
> > with
> > > the HBase project or not, would be and is most welcome as the problems
> of
> > > serving data at very large scale under "cloud" constraints is an area
> of
> > > both significant challenge and significant promise. HBase like other
> > > projects in this area are in an early stage of development. They cover
> > the
> > > use cases of their creators but, as answers to the larger set of
> > problems,
> > > they are not -- that space is untapped and only waiting for creativity
> > and
> > > effort. I
> > >  think I can speak for HBase in particular, we welcome this and would
> be
> > > pleased to assist at every opportunity.
> > > >
> > > >    - Andy
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
> > --
> > http://www.roadtofailure.com -- The Fringes of Scalability, Social
> Media,
> > and Computer Science
> >
>
>
>
>
>

Re: HBase mention in VLDB keynote

Posted by Andrew Purtell <ap...@apache.org>.

Hi Schubert,


> Regards "...and JG's/Ryan's performance test results for 0.20 stand as a contradiction." Can you provide more references? such as a url/link of these contradiction?

For JG: http://www.docstoc.com/docs/7493304/HBase-Goes-Realtime

I'm sure you have seen this already.

Ryan has posted some information on the list now and again.

Also I think your work with performance evaluation is very important feedback and data points. Thanks for that.

> We are doing a interesting thing to make Hive can use HBase as it's data store. Now we can use Hive's SQL to query/mapreduce data stored in HBase, and also we can directly query/scan data from HBase.

That sounds REALLY interesting!

   - Andy




________________________________
From: Schubert Zhang <zs...@gmail.com>
To: hbase-user@hadoop.apache.org
Sent: Tuesday, August 25, 2009 8:26:50 PM
Subject: Re: HBase mention in VLDB keynote

hi andy,

Even though current HBase is not yet ready for production, but we know it is
really testable and evaluation-able for its data model and architecture.

Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
contradiction." Can you provide more references? such as a url/link of these
contradiction?

Regards Hive, it's really a good design, especially about its abatraction of
MapReduce workflow matched to SQL. Hive made a good success inside
Facebook, the report says 29% of Facebook employees use Hive, and 51% of
those users are from outside engineering. It should be caused by the easy
leaned SQL than other languages such as Pig Latin, etc. In fact, Pig is now
adding features of metadata and sql, which are provided in Hive. But Hive is
still not very flexible to use alternate data store than HDFS files. We are
doing a interesting thing to make Hive can use HBase as it's data store. Now
we can use Hive's SQL to query/mapreduce data stored in HBase, and also we
can directly query/scan data from HBase.

I believe HBase can be a data store to work as a storage adapter layer above
HDFS. It is not a database, it is just a data storage adapter system above
HDFS, with a distributed b-tree clustered index. BigTable is designed to
provide more easy-used ways to store small data objects and provide
random-access, since GFS is designed for
sequential-access/batch-processing/large-data storage and GFS is not
appropriate to store small data objects and random-access.

I also believe HBase can be a data store to let MapReduce over HBase
possiable. If we review the Bigtable paper's, especially secetor 8, we can
find it is widely used for to do mapreduce analysis/summary in many google
applications.


In the recent ACM Queue interview to Sean Quinlan, Google GFS leader, we can
find google's new GFS integrated some data models of Bigtable.
http://queue.acm.org/detail.cfm?id=1594206


Schubert

On Wed, Aug 26, 2009 at 12:36 AM, Bradford Stephens <
bradfordstephens@gmail.com> wrote:

> Interesting. I need to see what sort of eval was going on for that
> presentation...
>
> He probably forgot to tweak GC :)
>
> On Tue, Aug 25, 2009 at 9:32 AM, Andrew Purtell <ap...@apache.org>
> wrote:
>
> > > Can we write him to figure more on how evaluation was done?
> >
> >
> > This was one interaction with that group, maybe the only other one aside
> > from a question about sizing memstore:
> > http://osdir.com/ml/hbase-user-hadoop-apache/2009-07/msg00552.html
> > Now I wonder if the eval was done via the REST gateway... A followup
> might
> > be useful. If I run into someone from Yahoo Research here I'll ask.
> > Otherwise we should try mailing them, yes.
> >
> > > Should we try and get into VLDB next year?
> >
> > We can certainly submit a candidate paper given a novel contribution of
> > some kind which moves the state of the art forward. There are other
> venues
> > besides VLDB also we can consider. Regardless, I think one of us should
> > attend VLDB every year.
> >
> > > Any thing else interesting at the conference?
> >
> > Yes.
> >
> > ETH Zurich presented a system which tailors consistency to the needs of
> > various data items -- "consistency rationing in the cloud: pay only when
> it
> > matters" -- choosing eventual (session) consistency or pessimistic 2PC on
> > demand according to a cost model, with good results. Made me think of
> > possibilities with THBase. Also, I watched a demo of HIVE, something I
> > hadn't see to date. Their query planner and mapreduce scheduler is
> > interesting in concept and in detail. We're looking at Cascading for
> batch
> > analytics on top of HBase instead, but knowing more about alternatives is
> > always good.
> >
> > The Hadoop-y track is really tomorrow.
> >
> > Outside of direct relevance to things HBase I attended talks on aspects
> of
> > data fusion, ETL, and complex event processing / stream processing,
> wearing
> > my TM hat. Lots of good stuff here.
> >
> >   - Andy
> >
> >
> >
> > ________________________________
> > From: Stack <sa...@gmail.com>
> > To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
> > Sent: Tuesday, August 25, 2009 4:47:57 PM
> > Subject: Re: HBase mention in VLDB keynote
> >
> > The same fella did keynote at apachecon eu on a similar topic.  Then he
> > talked mostly of Sherpa/pnuts yahoo tech.   In that presentation we got
> no
> > mention.  There the comparison strangely was to couchdb and perhaps
> > Cassandra (iirc).
> >
> > So, mention is an improvement (do you think the kick up the behind I
> > rendered him after his amsterdam talk could have had anything to do with
> > it?).
> >
> > Can we write him to figure more on how evaluation was done?
> >
> > Should we try and get into vldb next year?
> >
> > Good stuff Andy.  Any thing else interesting at the conference?
> >
> > Stack
> >
> >
> >
> > On Aug 25, 2009, at 6:17 AM, Andrew Purtell <ap...@apache.org> wrote:
> >
> > > In this keynote address here at VLDB 2009 (
> > http://vldb2009.org/?q=node/22) Raghu Ramakrishnan, Yahoo! Research's
> > Chief Scientist, made prominent mention of HBase, much to my surprise
> (and
> > later chagrin). This happened near the end of the talk when a number of
> the
> > new elastic/scalable/"nosql" storage systems were discussed to make
> concrete
> > some of the architectural and data model points made earlier. The
> > alternatives considered were Yahoo's PNUTS, sharded MySQL, HBase, and
> > Cassandra. I don't know what version of HBase was used exactly but
> > unfortunately the message was "not ready yet". Perhaps it was a
> > configuration or provisioning issue but HBase did not really survive the
> > evaluation, leading to short hyperbolic performance curves terminating on
> > the far left of the various graphs. This was quite disappointing to see
> as
> > the other alternatives were apparently successfully tested on what can be
> > presumed to be the same resources. It stands to reason there
> >  is opportunity for HBase to improve here if only we know what that is.
> It
> > was also a little disappointing that it appears through a mailing list
> > search that these issues were not brought to either hbase-dev@ or
> > hbase-users@, only a minor question relating to the REST interface.
> > Perhaps the community could have identified a specific configuration
> > problem, recommended a correction for a deployment/provisioning error, or
> > resolved a bug. To future evaluators of HBase, on behalf of the community
> I
> > humbly request that you share you results, good or bad, so we can take
> the
> > feedback, or the bug reports and their artifacts (logs, etc.) and improve
> > our software.
> > >
> > > At least, the story has already changed from what was presented today
> --
> > for example, the multimaster architecture of 0.20 was not presented,
> rather
> > the older one (circa 0.19); and JG's/Ryan's performance test results for
> > 0.20 stand as a contradiction. We should look into opportunities to
> produce
> > a peer reviewed positive contribution. I think we have opportunities to
> take
> > some novel approaches in the system itself and/or produce a novel
> vertical
> > contribution and 0.20 is a good substrate for that.
> > >
> > > Though this was unfortunately a missed opportunity for a good showing
> for
> > HBase in particular, the keynote in general was a well formulated
> > introduction of the emerging area of "cloud scale" storage / "nosql"
> systems
> > to the largest elite gathering of database and data processing
> researchers
> > in the world. The presentation was importantly also a call for
> participation
> > in the future development and directions of the new and growing "nosql"
> > constellation. Such participation, whether it is specific involvement
> with
> > the HBase project or not, would be and is most welcome as the problems of
> > serving data at very large scale under "cloud" constraints is an area of
> > both significant challenge and significant promise. HBase like other
> > projects in this area are in an early stage of development. They cover
> the
> > use cases of their creators but, as answers to the larger set of
> problems,
> > they are not -- that space is untapped and only waiting for creativity
> and
> > effort. I
> >  think I can speak for HBase in particular, we welcome this and would be
> > pleased to assist at every opportunity.
> > >
> > >    - Andy
> > >
> > >
> >
> >
> >
> >
> >
>
>
>
> --
> http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
> and Computer Science
>

Re: HBase mention in VLDB keynote

Posted by stack <st...@duboce.net>.

On Tue, Aug 25, 2009 at 11:26 AM, Schubert Zhang <zs...@gmail.com> wrote:

> ....We are
> doing a interesting thing to make Hive can use HBase as it's data store.
> Now
> we can use Hive's SQL to query/mapreduce data stored in HBase, and also we
> can directly query/scan data from HBase.

Others are interested in this project Schubert: See
https://issues.apache.org/jira/browse/HIVE-705.  You should dump into that
issue any thoughts or progress made on this front.

St.Ack

Re: HBase mention in VLDB keynote

Posted by Schubert Zhang <zs...@gmail.com>.

hi andy,

Even though current HBase is not yet ready for production, but we know it is
really testable and evaluation-able for its data model and architecture.

Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
contradiction." Can you provide more references? such as a url/link of these
contradiction?

Regards Hive, it's really a good design, especially about its abatraction of
MapReduce workflow matched to SQL. Hive made a good success inside
Facebook, the report says 29% of Facebook employees use Hive, and 51% of
those users are from outside engineering. It should be caused by the easy
leaned SQL than other languages such as Pig Latin, etc. In fact, Pig is now
adding features of metadata and sql, which are provided in Hive. But Hive is
still not very flexible to use alternate data store than HDFS files. We are
doing a interesting thing to make Hive can use HBase as it's data store. Now
we can use Hive's SQL to query/mapreduce data stored in HBase, and also we
can directly query/scan data from HBase.

I believe HBase can be a data store to work as a storage adapter layer above
HDFS. It is not a database, it is just a data storage adapter system above
HDFS, with a distributed b-tree clustered index. BigTable is designed to
provide more easy-used ways to store small data objects and provide
random-access, since GFS is designed for
sequential-access/batch-processing/large-data storage and GFS is not
appropriate to store small data objects and random-access.

I also believe HBase can be a data store to let MapReduce over HBase
possiable. If we review the Bigtable paper's, especially secetor 8, we can
find it is widely used for to do mapreduce analysis/summary in many google
applications.


In the recent ACM Queue interview to Sean Quinlan, Google GFS leader, we can
find google's new GFS integrated some data models of Bigtable.
http://queue.acm.org/detail.cfm?id=1594206


Schubert

On Wed, Aug 26, 2009 at 12:36 AM, Bradford Stephens <
bradfordstephens@gmail.com> wrote:

> Interesting. I need to see what sort of eval was going on for that
> presentation...
>
> He probably forgot to tweak GC :)
>
> On Tue, Aug 25, 2009 at 9:32 AM, Andrew Purtell <ap...@apache.org>
> wrote:
>
> > > Can we write him to figure more on how evaluation was done?
> >
> >
> > This was one interaction with that group, maybe the only other one aside
> > from a question about sizing memstore:
> > http://osdir.com/ml/hbase-user-hadoop-apache/2009-07/msg00552.html
> > Now I wonder if the eval was done via the REST gateway... A followup
> might
> > be useful. If I run into someone from Yahoo Research here I'll ask.
> > Otherwise we should try mailing them, yes.
> >
> > > Should we try and get into VLDB next year?
> >
> > We can certainly submit a candidate paper given a novel contribution of
> > some kind which moves the state of the art forward. There are other
> venues
> > besides VLDB also we can consider. Regardless, I think one of us should
> > attend VLDB every year.
> >
> > > Any thing else interesting at the conference?
> >
> > Yes.
> >
> > ETH Zurich presented a system which tailors consistency to the needs of
> > various data items -- "consistency rationing in the cloud: pay only when
> it
> > matters" -- choosing eventual (session) consistency or pessimistic 2PC on
> > demand according to a cost model, with good results. Made me think of
> > possibilities with THBase. Also, I watched a demo of HIVE, something I
> > hadn't see to date. Their query planner and mapreduce scheduler is
> > interesting in concept and in detail. We're looking at Cascading for
> batch
> > analytics on top of HBase instead, but knowing more about alternatives is
> > always good.
> >
> > The Hadoop-y track is really tomorrow.
> >
> > Outside of direct relevance to things HBase I attended talks on aspects
> of
> > data fusion, ETL, and complex event processing / stream processing,
> wearing
> > my TM hat. Lots of good stuff here.
> >
> >   - Andy
> >
> >
> >
> > ________________________________
> > From: Stack <sa...@gmail.com>
> > To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
> > Sent: Tuesday, August 25, 2009 4:47:57 PM
> > Subject: Re: HBase mention in VLDB keynote
> >
> > The same fella did keynote at apachecon eu on a similar topic.  Then he
> > talked mostly of Sherpa/pnuts yahoo tech.   In that presentation we got
> no
> > mention.  There the comparison strangely was to couchdb and perhaps
> > Cassandra (iirc).
> >
> > So, mention is an improvement (do you think the kick up the behind I
> > rendered him after his amsterdam talk could have had anything to do with
> > it?).
> >
> > Can we write him to figure more on how evaluation was done?
> >
> > Should we try and get into vldb next year?
> >
> > Good stuff Andy.  Any thing else interesting at the conference?
> >
> > Stack
> >
> >
> >
> > On Aug 25, 2009, at 6:17 AM, Andrew Purtell <ap...@apache.org> wrote:
> >
> > > In this keynote address here at VLDB 2009 (
> > http://vldb2009.org/?q=node/22) Raghu Ramakrishnan, Yahoo! Research's
> > Chief Scientist, made prominent mention of HBase, much to my surprise
> (and
> > later chagrin). This happened near the end of the talk when a number of
> the
> > new elastic/scalable/"nosql" storage systems were discussed to make
> concrete
> > some of the architectural and data model points made earlier. The
> > alternatives considered were Yahoo's PNUTS, sharded MySQL, HBase, and
> > Cassandra. I don't know what version of HBase was used exactly but
> > unfortunately the message was "not ready yet". Perhaps it was a
> > configuration or provisioning issue but HBase did not really survive the
> > evaluation, leading to short hyperbolic performance curves terminating on
> > the far left of the various graphs. This was quite disappointing to see
> as
> > the other alternatives were apparently successfully tested on what can be
> > presumed to be the same resources. It stands to reason there
> >  is opportunity for HBase to improve here if only we know what that is.
> It
> > was also a little disappointing that it appears through a mailing list
> > search that these issues were not brought to either hbase-dev@ or
> > hbase-users@, only a minor question relating to the REST interface.
> > Perhaps the community could have identified a specific configuration
> > problem, recommended a correction for a deployment/provisioning error, or
> > resolved a bug. To future evaluators of HBase, on behalf of the community
> I
> > humbly request that you share you results, good or bad, so we can take
> the
> > feedback, or the bug reports and their artifacts (logs, etc.) and improve
> > our software.
> > >
> > > At least, the story has already changed from what was presented today
> --
> > for example, the multimaster architecture of 0.20 was not presented,
> rather
> > the older one (circa 0.19); and JG's/Ryan's performance test results for
> > 0.20 stand as a contradiction. We should look into opportunities to
> produce
> > a peer reviewed positive contribution. I think we have opportunities to
> take
> > some novel approaches in the system itself and/or produce a novel
> vertical
> > contribution and 0.20 is a good substrate for that.
> > >
> > > Though this was unfortunately a missed opportunity for a good showing
> for
> > HBase in particular, the keynote in general was a well formulated
> > introduction of the emerging area of "cloud scale" storage / "nosql"
> systems
> > to the largest elite gathering of database and data processing
> researchers
> > in the world. The presentation was importantly also a call for
> participation
> > in the future development and directions of the new and growing "nosql"
> > constellation. Such participation, whether it is specific involvement
> with
> > the HBase project or not, would be and is most welcome as the problems of
> > serving data at very large scale under "cloud" constraints is an area of
> > both significant challenge and significant promise. HBase like other
> > projects in this area are in an early stage of development. They cover
> the
> > use cases of their creators but, as answers to the larger set of
> problems,
> > they are not -- that space is untapped and only waiting for creativity
> and
> > effort. I
> >  think I can speak for HBase in particular, we welcome this and would be
> > pleased to assist at every opportunity.
> > >
> > >    - Andy
> > >
> > >
> >
> >
> >
> >
> >
>
>
>
> --
> http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
> and Computer Science
>

Re: HBase mention in VLDB keynote

Posted by Bradford Stephens <br...@gmail.com>.

Interesting. I need to see what sort of eval was going on for that
presentation...

He probably forgot to tweak GC :)

On Tue, Aug 25, 2009 at 9:32 AM, Andrew Purtell <ap...@apache.org> wrote:

> > Can we write him to figure more on how evaluation was done?
>
>
> This was one interaction with that group, maybe the only other one aside
> from a question about sizing memstore:
> http://osdir.com/ml/hbase-user-hadoop-apache/2009-07/msg00552.html
> Now I wonder if the eval was done via the REST gateway... A followup might
> be useful. If I run into someone from Yahoo Research here I'll ask.
> Otherwise we should try mailing them, yes.
>
> > Should we try and get into VLDB next year?
>
> We can certainly submit a candidate paper given a novel contribution of
> some kind which moves the state of the art forward. There are other venues
> besides VLDB also we can consider. Regardless, I think one of us should
> attend VLDB every year.
>
> > Any thing else interesting at the conference?
>
> Yes.
>
> ETH Zurich presented a system which tailors consistency to the needs of
> various data items -- "consistency rationing in the cloud: pay only when it
> matters" -- choosing eventual (session) consistency or pessimistic 2PC on
> demand according to a cost model, with good results. Made me think of
> possibilities with THBase. Also, I watched a demo of HIVE, something I
> hadn't see to date. Their query planner and mapreduce scheduler is
> interesting in concept and in detail. We're looking at Cascading for batch
> analytics on top of HBase instead, but knowing more about alternatives is
> always good.
>
> The Hadoop-y track is really tomorrow.
>
> Outside of direct relevance to things HBase I attended talks on aspects of
> data fusion, ETL, and complex event processing / stream processing, wearing
> my TM hat. Lots of good stuff here.
>
>   - Andy
>
>
>
> ________________________________
> From: Stack <sa...@gmail.com>
> To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
> Sent: Tuesday, August 25, 2009 4:47:57 PM
> Subject: Re: HBase mention in VLDB keynote
>
> The same fella did keynote at apachecon eu on a similar topic.  Then he
> talked mostly of Sherpa/pnuts yahoo tech.   In that presentation we got no
> mention.  There the comparison strangely was to couchdb and perhaps
> Cassandra (iirc).
>
> So, mention is an improvement (do you think the kick up the behind I
> rendered him after his amsterdam talk could have had anything to do with
> it?).
>
> Can we write him to figure more on how evaluation was done?
>
> Should we try and get into vldb next year?
>
> Good stuff Andy.  Any thing else interesting at the conference?
>
> Stack
>
>
>
> On Aug 25, 2009, at 6:17 AM, Andrew Purtell <ap...@apache.org> wrote:
>
> > In this keynote address here at VLDB 2009 (
> http://vldb2009.org/?q=node/22) Raghu Ramakrishnan, Yahoo! Research's
> Chief Scientist, made prominent mention of HBase, much to my surprise (and
> later chagrin). This happened near the end of the talk when a number of the
> new elastic/scalable/"nosql" storage systems were discussed to make concrete
> some of the architectural and data model points made earlier. The
> alternatives considered were Yahoo's PNUTS, sharded MySQL, HBase, and
> Cassandra. I don't know what version of HBase was used exactly but
> unfortunately the message was "not ready yet". Perhaps it was a
> configuration or provisioning issue but HBase did not really survive the
> evaluation, leading to short hyperbolic performance curves terminating on
> the far left of the various graphs. This was quite disappointing to see as
> the other alternatives were apparently successfully tested on what can be
> presumed to be the same resources. It stands to reason there
>  is opportunity for HBase to improve here if only we know what that is. It
> was also a little disappointing that it appears through a mailing list
> search that these issues were not brought to either hbase-dev@ or
> hbase-users@, only a minor question relating to the REST interface.
> Perhaps the community could have identified a specific configuration
> problem, recommended a correction for a deployment/provisioning error, or
> resolved a bug. To future evaluators of HBase, on behalf of the community I
> humbly request that you share you results, good or bad, so we can take the
> feedback, or the bug reports and their artifacts (logs, etc.) and improve
> our software.
> >
> > At least, the story has already changed from what was presented today --
> for example, the multimaster architecture of 0.20 was not presented, rather
> the older one (circa 0.19); and JG's/Ryan's performance test results for
> 0.20 stand as a contradiction. We should look into opportunities to produce
> a peer reviewed positive contribution. I think we have opportunities to take
> some novel approaches in the system itself and/or produce a novel vertical
> contribution and 0.20 is a good substrate for that.
> >
> > Though this was unfortunately a missed opportunity for a good showing for
> HBase in particular, the keynote in general was a well formulated
> introduction of the emerging area of "cloud scale" storage / "nosql" systems
> to the largest elite gathering of database and data processing researchers
> in the world. The presentation was importantly also a call for participation
> in the future development and directions of the new and growing "nosql"
> constellation. Such participation, whether it is specific involvement with
> the HBase project or not, would be and is most welcome as the problems of
> serving data at very large scale under "cloud" constraints is an area of
> both significant challenge and significant promise. HBase like other
> projects in this area are in an early stage of development. They cover the
> use cases of their creators but, as answers to the larger set of problems,
> they are not -- that space is untapped and only waiting for creativity and
> effort. I
>  think I can speak for HBase in particular, we welcome this and would be
> pleased to assist at every opportunity.
> >
> >    - Andy
> >
> >
>
>
>
>
>



-- 
http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
and Computer Science

Re: HBase mention in VLDB keynote

Posted by Andrew Purtell <ap...@apache.org>.

> Can we write him to figure more on how evaluation was done?

This was one interaction with that group, maybe the only other one aside from a question about sizing memstore: http://osdir.com/ml/hbase-user-hadoop-apache/2009-07/msg00552.html 
Now I wonder if the eval was done via the REST gateway... A followup might be useful. If I run into someone from Yahoo Research here I'll ask. Otherwise we should try mailing them, yes.

> Should we try and get into VLDB next year?

We can certainly submit a candidate paper given a novel contribution of some kind which moves the state of the art forward. There are other venues besides VLDB also we can consider. Regardless, I think one of us should attend VLDB every year. 

> Any thing else interesting at the conference?

Yes. 

ETH Zurich presented a system which tailors consistency to the needs of various data items -- "consistency rationing in the cloud: pay only when it matters" -- choosing eventual (session) consistency or pessimistic 2PC on demand according to a cost model, with good results. Made me think of possibilities with THBase. Also, I watched a demo of HIVE, something I hadn't see to date. Their query planner and mapreduce scheduler is interesting in concept and in detail. We're looking at Cascading for batch analytics on top of HBase instead, but knowing more about alternatives is always good.

The Hadoop-y track is really tomorrow. 

Outside of direct relevance to things HBase I attended talks on aspects of data fusion, ETL, and complex event processing / stream processing, wearing my TM hat. Lots of good stuff here.

   - Andy

________________________________
From: Stack <sa...@gmail.com>
To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
Sent: Tuesday, August 25, 2009 4:47:57 PM
Subject: Re: HBase mention in VLDB keynote

The same fella did keynote at apachecon eu on a similar topic.  Then he talked mostly of Sherpa/pnuts yahoo tech.   In that presentation we got no mention.  There the comparison strangely was to couchdb and perhaps Cassandra (iirc).

So, mention is an improvement (do you think the kick up the behind I rendered him after his amsterdam talk could have had anything to do with it?).

Can we write him to figure more on how evaluation was done?

Should we try and get into vldb next year?

Good stuff Andy.  Any thing else interesting at the conference?

Stack

On Aug 25, 2009, at 6:17 AM, Andrew Purtell <ap...@apache.org> wrote:

> In this keynote address here at VLDB 2009 (http://vldb2009.org/?q=node/22) Raghu Ramakrishnan, Yahoo! Research's Chief Scientist, made prominent mention of HBase, much to my surprise (and later chagrin). This happened near the end of the talk when a number of the new elastic/scalable/"nosql" storage systems were discussed to make concrete some of the architectural and data model points made earlier. The alternatives considered were Yahoo's PNUTS, sharded MySQL, HBase, and Cassandra. I don't know what version of HBase was used exactly but unfortunately the message was "not ready yet". Perhaps it was a configuration or provisioning issue but HBase did not really survive the evaluation, leading to short hyperbolic performance curves terminating on the far left of the various graphs. This was quite disappointing to see as the other alternatives were apparently successfully tested on what can be presumed to be the same resources. It stands to reason there
 is opportunity for HBase to improve here if only we know what that is. It was also a little disappointing that it appears through a mailing list search that these issues were not brought to either hbase-dev@ or hbase-users@, only a minor question relating to the REST interface. Perhaps the community could have identified a specific configuration problem, recommended a correction for a deployment/provisioning error, or resolved a bug. To future evaluators of HBase, on behalf of the community I humbly request that you share you results, good or bad, so we can take the feedback, or the bug reports and their artifacts (logs, etc.) and improve our software.
> 
> At least, the story has already changed from what was presented today -- for example, the multimaster architecture of 0.20 was not presented, rather the older one (circa 0.19); and JG's/Ryan's performance test results for 0.20 stand as a contradiction. We should look into opportunities to produce a peer reviewed positive contribution. I think we have opportunities to take some novel approaches in the system itself and/or produce a novel vertical contribution and 0.20 is a good substrate for that.
> 
> Though this was unfortunately a missed opportunity for a good showing for HBase in particular, the keynote in general was a well formulated introduction of the emerging area of "cloud scale" storage / "nosql" systems to the largest elite gathering of database and data processing researchers in the world. The presentation was importantly also a call for participation in the future development and directions of the new and growing "nosql" constellation. Such participation, whether it is specific involvement with the HBase project or not, would be and is most welcome as the problems of serving data at very large scale under "cloud" constraints is an area of both significant challenge and significant promise. HBase like other projects in this area are in an early stage of development. They cover the use cases of their creators but, as answers to the larger set of problems, they are not -- that space is untapped and only waiting for creativity and effort. I
 think I can speak for HBase in particular, we welcome this and would be pleased to assist at every opportunity.
> 
>    - Andy
> 
>

Re: HBase mention in VLDB keynote

Posted by Stack <sa...@gmail.com>.

The same fella did keynote at apachecon eu on a similar topic.  Then  
he talked mostly of Sherpa/pnuts yahoo tech.   In that presentation we  
got no mention.  There the comparison strangely was to couchdb and  
perhaps Cassandra (iirc).

So, mention is an improvement (do you think the kick up the behind I  
rendered him after his amsterdam talk could have had anything to do  
with it?).

Can we write him to figure more on how evaluation was done?

Should we try and get into vldb next year?

Good stuff Andy.  Any thing else interesting at the conference?

Stack



On Aug 25, 2009, at 6:17 AM, Andrew Purtell <ap...@apache.org> wrote:

> In this keynote address here at VLDB 2009 (http://vldb2009.org/?q=node/22 
> ) Raghu Ramakrishnan, Yahoo! Research's Chief Scientist, made  
> prominent mention of HBase, much to my surprise (and later chagrin).  
> This happened near the end of the talk when a number of the new  
> elastic/scalable/"nosql" storage systems were discussed to make  
> concrete some of the architectural and data model points made  
> earlier. The alternatives considered were Yahoo's PNUTS, sharded  
> MySQL, HBase, and Cassandra. I don't know what version of HBase was  
> used exactly but unfortunately the message was "not ready yet".  
> Perhaps it was a configuration or provisioning issue but HBase did  
> not really survive the evaluation, leading to short hyperbolic  
> performance curves terminating on the far left of the various  
> graphs. This was quite disappointing to see as the other  
> alternatives were apparently successfully tested on what can be  
> presumed to be the same resources. It stands to reason there is
> opportunity for HBase to improve here if only we know what that is.  
> It was also a little disappointing that it appears through a mailing  
> list search that these issues were not brought to either hbase-dev@  
> or hbase-users@, only a minor question relating to the REST  
> interface. Perhaps the community could have identified a specific  
> configuration problem, recommended a correction for a deployment/ 
> provisioning error, or resolved a bug. To future evaluators of  
> HBase, on behalf of the community I humbly request that you share  
> you results, good or bad, so we can take the feedback, or the bug  
> reports and their artifacts (logs, etc.) and improve our software.
>
> At least, the story has already changed from what was presented  
> today -- for example, the multimaster architecture of 0.20 was not  
> presented, rather the older one (circa 0.19); and JG's/Ryan's  
> performance test results for 0.20 stand as a contradiction. We  
> should look into opportunities to produce a peer reviewed positive  
> contribution. I think we have opportunities to take some novel  
> approaches in the system itself and/or produce a novel vertical  
> contribution and 0.20 is a good substrate for that.
>
> Though this was unfortunately a missed opportunity for a good  
> showing for HBase in particular, the keynote in general was a well  
> formulated introduction of the emerging area of "cloud scale"  
> storage / "nosql" systems to the largest elite gathering of database  
> and data processing researchers in the world. The presentation was  
> importantly also a call for participation in the future development  
> and directions of the new and growing "nosql" constellation. Such  
> participation, whether it is specific involvement with the HBase  
> project or not, would be and is most welcome as the problems of  
> serving data at very large scale under "cloud" constraints is an  
> area of both significant challenge and significant promise. HBase  
> like other projects in this area are in an early stage of  
> development. They cover the use cases of their creators but, as  
> answers to the larger set of problems, they are not -- that space is  
> untapped and only waiting for creativity and effort. I
> think I can speak for HBase in particular, we welcome this and would  
> be pleased to assist at every opportunity.
>
>    - Andy
>
>