You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Greg Cottman <gr...@quest.com> on 2009/08/18 10:13:23 UTC

Public HBase data store?

Hi all,

I need to do some scalability testing of an HBase query tool.  We have just started using HBase and sadly do not have an existing database against which to test.  Things we are interested in exploring is the difference between using an index table strategy versus map/reduce queries without indexes.

I realise this is a long shot and that queries are very data-dependent, but...  Are there any publicly accessible HBase stores or reference sites against which you can run test queries?

Or does everyone just create a 10 billion row test environment on their local development box?  :-)

Cheers,
Greg.

Re: Public HBase data store?

Posted by Jonathan Gray <jl...@streamy.com>.
Tim,

We do things like that.  Both out of search indexes as well as to 
perform simple "joins" where one table might have an ordered list of ids 
in a family together, we grab a "page", and then perform a join by 
grabbing a set of columns from a different table, one row per id.

Yes joins can be a dirty word but in the cases where we do simple joins 
like this, the data is duplicated so many times that denormalization is 
not feasible.  And in your case, actually storing the data fields in 
Lucene is extremely expensive, so it can certainly make sense.

One thing... If you are going to have a number of "get by key" calls for 
a single query/page, running them in parallel can significantly improve 
total time.  This is especially the case if the keys you need to query 
for are well dispersed across the table (so you can hit multiple 
regionservers).

JG

tim robertson wrote:
> Hi Ryan,
> 
> What kind of random row lookup throughput do you get (e.g. rows per
> second) on the 10b store on the 20 machine cluster (assuming client
> isn't saturating)?
> 
> I'm pondering indexing hbase rows in various ways with Lucene with
> only the row key stored.  Then page over search results and stream out
> response (transforming to preferred response format on the fly - RDF,
> CSV, XML etc) by doing sequential "get by key" calls.  Maybe stupid
> idea, but not sure what else can index so well.
> 
> I'm just curious...
> 
> Thanks,
> Tim
> 
> 
> 
> 
> On Tue, Aug 18, 2009 at 10:07 PM, Ryan Rawson<ry...@gmail.com> wrote:
>> I run real machines, they aren't too expensive and are substantially
>> more performant than the virtualized servers EC2 offers. I have 10b
>> rows loaded on 20 machines, but you could probably do that on 10 or
>> so. Don't forget that 10b rows would require a $40000 machine to use
>> on mysql, so why not spend $40000 on a cluster?
>>
>> On Tue, Aug 18, 2009 at 12:20 PM, Jonathan Gray<jl...@streamy.com> wrote:
>>> I have a little util I created called HBench.  You can customize the
>>> different parameters to generate data of varying sizes/patterns/etc.
>>>
>>> https://issues.apache.org/jira/browse/HBASE-1501
>>>
>>> JG
>>>
>>> Andrew Purtell wrote:
>>>> Most that I am aware of set up transient test environments up on EC2.
>>>>
>>>> You can use one instance to create an EBS volume containing all software
>>>> and config you need, then snapshot it, then clone volumes based on the
>>>> snapshot to attach to any number of instances you need. Use X-Large
>>>> instances, at least 4. Give HBase regionservers 2GB heap. Then try your
>>>> 10 billion row test case.
>>>>
>>>>   - Andy
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Greg Cottman <gr...@quest.com>
>>>> To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
>>>> Sent: Tuesday, August 18, 2009 4:13:23 PM
>>>> Subject: Public HBase data store?
>>>>
>>>> Hi all,
>>>>
>>>> I need to do some scalability testing of an HBase query tool.  We have
>>>> just started using HBase and sadly do not have an existing database against
>>>> which to test.  Things we are interested in exploring is the difference
>>>> between using an index table strategy versus map/reduce queries without
>>>> indexes.
>>>>
>>>> I realise this is a long shot and that queries are very data-dependent,
>>>> but...  Are there any publicly accessible HBase stores or reference sites
>>>> against which you can run test queries?
>>>>
>>>> Or does everyone just create a 10 billion row test environment on their
>>>> local development box?  :-)
>>>>
>>>> Cheers,
>>>> Greg.
>>>>
>>>>
>>>>
>>>>
> 

Re: Public HBase data store?

Posted by tim robertson <ti...@gmail.com>.
Hi Ryan,

What kind of random row lookup throughput do you get (e.g. rows per
second) on the 10b store on the 20 machine cluster (assuming client
isn't saturating)?

I'm pondering indexing hbase rows in various ways with Lucene with
only the row key stored.  Then page over search results and stream out
response (transforming to preferred response format on the fly - RDF,
CSV, XML etc) by doing sequential "get by key" calls.  Maybe stupid
idea, but not sure what else can index so well.

I'm just curious...

Thanks,
Tim




On Tue, Aug 18, 2009 at 10:07 PM, Ryan Rawson<ry...@gmail.com> wrote:
> I run real machines, they aren't too expensive and are substantially
> more performant than the virtualized servers EC2 offers. I have 10b
> rows loaded on 20 machines, but you could probably do that on 10 or
> so. Don't forget that 10b rows would require a $40000 machine to use
> on mysql, so why not spend $40000 on a cluster?
>
> On Tue, Aug 18, 2009 at 12:20 PM, Jonathan Gray<jl...@streamy.com> wrote:
>> I have a little util I created called HBench.  You can customize the
>> different parameters to generate data of varying sizes/patterns/etc.
>>
>> https://issues.apache.org/jira/browse/HBASE-1501
>>
>> JG
>>
>> Andrew Purtell wrote:
>>>
>>> Most that I am aware of set up transient test environments up on EC2.
>>>
>>> You can use one instance to create an EBS volume containing all software
>>> and config you need, then snapshot it, then clone volumes based on the
>>> snapshot to attach to any number of instances you need. Use X-Large
>>> instances, at least 4. Give HBase regionservers 2GB heap. Then try your
>>> 10 billion row test case.
>>>
>>>   - Andy
>>>
>>>
>>>
>>>
>>> ________________________________
>>> From: Greg Cottman <gr...@quest.com>
>>> To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
>>> Sent: Tuesday, August 18, 2009 4:13:23 PM
>>> Subject: Public HBase data store?
>>>
>>> Hi all,
>>>
>>> I need to do some scalability testing of an HBase query tool.  We have
>>> just started using HBase and sadly do not have an existing database against
>>> which to test.  Things we are interested in exploring is the difference
>>> between using an index table strategy versus map/reduce queries without
>>> indexes.
>>>
>>> I realise this is a long shot and that queries are very data-dependent,
>>> but...  Are there any publicly accessible HBase stores or reference sites
>>> against which you can run test queries?
>>>
>>> Or does everyone just create a 10 billion row test environment on their
>>> local development box?  :-)
>>>
>>> Cheers,
>>> Greg.
>>>
>>>
>>>
>>>
>>
>

Re: Public HBase data store?

Posted by Ryan Rawson <ry...@gmail.com>.
I run real machines, they aren't too expensive and are substantially
more performant than the virtualized servers EC2 offers. I have 10b
rows loaded on 20 machines, but you could probably do that on 10 or
so. Don't forget that 10b rows would require a $40000 machine to use
on mysql, so why not spend $40000 on a cluster?

On Tue, Aug 18, 2009 at 12:20 PM, Jonathan Gray<jl...@streamy.com> wrote:
> I have a little util I created called HBench.  You can customize the
> different parameters to generate data of varying sizes/patterns/etc.
>
> https://issues.apache.org/jira/browse/HBASE-1501
>
> JG
>
> Andrew Purtell wrote:
>>
>> Most that I am aware of set up transient test environments up on EC2.
>>
>> You can use one instance to create an EBS volume containing all software
>> and config you need, then snapshot it, then clone volumes based on the
>> snapshot to attach to any number of instances you need. Use X-Large
>> instances, at least 4. Give HBase regionservers 2GB heap. Then try your
>> 10 billion row test case.
>>
>>   - Andy
>>
>>
>>
>>
>> ________________________________
>> From: Greg Cottman <gr...@quest.com>
>> To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
>> Sent: Tuesday, August 18, 2009 4:13:23 PM
>> Subject: Public HBase data store?
>>
>> Hi all,
>>
>> I need to do some scalability testing of an HBase query tool.  We have
>> just started using HBase and sadly do not have an existing database against
>> which to test.  Things we are interested in exploring is the difference
>> between using an index table strategy versus map/reduce queries without
>> indexes.
>>
>> I realise this is a long shot and that queries are very data-dependent,
>> but...  Are there any publicly accessible HBase stores or reference sites
>> against which you can run test queries?
>>
>> Or does everyone just create a 10 billion row test environment on their
>> local development box?  :-)
>>
>> Cheers,
>> Greg.
>>
>>
>>
>>
>

Re: Public HBase data store?

Posted by Jonathan Gray <jl...@streamy.com>.
I have a little util I created called HBench.  You can customize the 
different parameters to generate data of varying sizes/patterns/etc.

https://issues.apache.org/jira/browse/HBASE-1501

JG

Andrew Purtell wrote:
> Most that I am aware of set up transient test environments up on EC2.
> 
> You can use one instance to create an EBS volume containing all software
> and config you need, then snapshot it, then clone volumes based on the
> snapshot to attach to any number of instances you need. Use X-Large 
> instances, at least 4. Give HBase regionservers 2GB heap. Then try your
> 10 billion row test case.
> 
>    - Andy
> 
> 
> 
> 
> ________________________________
> From: Greg Cottman <gr...@quest.com>
> To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
> Sent: Tuesday, August 18, 2009 4:13:23 PM
> Subject: Public HBase data store?
> 
> Hi all,
> 
> I need to do some scalability testing of an HBase query tool.  We have just started using HBase and sadly do not have an existing database against which to test.  Things we are interested in exploring is the difference between using an index table strategy versus map/reduce queries without indexes.
> 
> I realise this is a long shot and that queries are very data-dependent, but...  Are there any publicly accessible HBase stores or reference sites against which you can run test queries?
> 
> Or does everyone just create a 10 billion row test environment on their local development box?  :-)
> 
> Cheers,
> Greg.
> 
> 
> 
>       

Re: Public HBase data store?

Posted by Andrew Purtell <ap...@apache.org>.
Most that I am aware of set up transient test environments up on EC2.

You can use one instance to create an EBS volume containing all software
and config you need, then snapshot it, then clone volumes based on the
snapshot to attach to any number of instances you need. Use X-Large 
instances, at least 4. Give HBase regionservers 2GB heap. Then try your
10 billion row test case.

   - Andy




________________________________
From: Greg Cottman <gr...@quest.com>
To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
Sent: Tuesday, August 18, 2009 4:13:23 PM
Subject: Public HBase data store?

Hi all,

I need to do some scalability testing of an HBase query tool.  We have just started using HBase and sadly do not have an existing database against which to test.  Things we are interested in exploring is the difference between using an index table strategy versus map/reduce queries without indexes.

I realise this is a long shot and that queries are very data-dependent, but...  Are there any publicly accessible HBase stores or reference sites against which you can run test queries?

Or does everyone just create a 10 billion row test environment on their local development box?  :-)

Cheers,
Greg.