You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Billy Pearson <sa...@pearsonwholesale.com> on 2009/03/27 07:20:57 UTC

no memory tables

I was wondering if anyone else out there would like to use hbase to support 
storing data that does not need random access just insert/delete/scan
If we could support a table like this that would require little to no memory 
but still allow sorted scanable updateable data to be
stored in hbase with out the need to have index of keys in memory.
We should still have memory usage with inserts stored in memcache but no key 
index in memory.

This would allow large datasets that do not need random access to be stored 
and still give access to new/live
data with scans with out having to merge/sort the data on disk manually 
before seeing updates.

I have a large amount of data coming in that needs expired over time. I 
store in hadoop and run MR jobs over it to produce accessible index of the 
data via hbase.
The ideal here is if I could import that data in to hbase then I can access 
subsets of the data with out having to read all the data to find what I am 
looking for.
with this hbase could merge/sort/expire/split the data as needed and still 
give access to newly inserted data.

This might take some memory on the master node but I would not thank there 
would be a limit on the size of the data except the hadoop storage size.
Anyone else thank they could use something like this also?

Billy Pearson

Re: no memory tables

Posted by Billy Pearson <bi...@sbcglobal.net>.


Yes I agree hfile seams to be better on memory and speed right now
Still would like to see something like a scan only like we have a read only 
flag
the index could still be created but with scan only flag turned on the index 
are not loaded in memory.

Billy




----- Original Message ----- 
From: "Ryan Rawson" <ry...@public.gmane.org>
Newsgroups: gmane.comp.java.hadoop.hbase.user
To: <hb...@public.gmane.org>
Sent: Friday, March 27, 2009 1:31 AM
Subject: Re: no memory tables


> Hey,
>
> Interesting ideas - there are some features in 0.20 that might obviate the
> need for some of the suggestions below...
>
> One major problem with hbase 0.19 is the indexing scheme - an index entry 
> is
> created every 128 entries.  With large data sets with small key/values, 
> this
> is a major problem.
>
> But in hbase 0.20, the index is now based on blocks.  On my own test:
> - 1 hfile that is 161 MB on disk
> - contains 11m key/values
> - represents about 5.5 million rows
> - 3.7x compression
> - default block size (pre-compression) of 64kBytes
> - in-memory block index size: 770kBytes.
>
> One problem with 0.19 is the size of in-memory indexes... With hfile in 
> 0.20
> we will have many less problems.
>
>
> On Thu, Mar 26, 2009 at 11:20 PM, Billy Pearson
> <sa...@public.gmane.org>wrote:
>
>> I was wondering if anyone else out there would like to use hbase to 
>> support
>> storing data that does not need random access just insert/delete/scan
>> If we could support a table like this that would require little to no
>> memory but still allow sorted scanable updateable data to be
>> stored in hbase with out the need to have index of keys in memory.
>> We should still have memory usage with inserts stored in memcache but no
>> key index in memory.
>>
>> This would allow large datasets that do not need random access to be 
>> stored
>> and still give access to new/live
>> data with scans with out having to merge/sort the data on disk manually
>> before seeing updates.
>>
>> I have a large amount of data coming in that needs expired over time. I
>> store in hadoop and run MR jobs over it to produce accessible index of 
>> the
>> data via hbase.
>> The ideal here is if I could import that data in to hbase then I can 
>> access
>> subsets of the data with out having to read all the data to find what I 
>> am
>> looking for.
>> with this hbase could merge/sort/expire/split the data as needed and 
>> still
>> give access to newly inserted data.
>>
>> This might take some memory on the master node but I would not thank 
>> there
>> would be a limit on the size of the data except the hadoop storage size.
>> Anyone else thank they could use something like this also?
>>
>> Billy Pearson
>>
>>
>>
>>
>

Re: no memory tables

Posted by Andrew Purtell <ap...@apache.org>.

Regrettably I no longer control that 25 node cluster, a
consequence of a promotion. Call me a victim of my own 
success with it. But actually I am up on AWS right now
finishing up a 64-bit AMI and EBS snapshot I can use to stand
up HBase test clusters on demand. 

I'll be reinitializing at every experiment start anyhow, so
hfile format changes won't be a problem at all.

   - Andy

> From: Ryan Rawson
> Subject: Re: no memory tables
> To: hbase-user@hadoop.apache.org, apurtell@apache.org
> Date: Friday, March 27, 2009, 1:28 AM
> Trunk is workable right now - but ymmv and there is no, and
> I repeat, NO guarantee the file format won't change at a
> moment's notice.  I mean it - you could svn up and your
> installation will be trashed, with no way to
> recover it except rm -rf /hbase.
> 
> In fact, let me stress again - there is an outstanding
> patch that will infact change the basic storage format (key
> format in hfile to be specific).
> 
> BUT
> 
> If you are willing to toss test data in a throw-away
> instance, give it a spin.  Watch as your 25 node cluster
> sustains 200-300k ops/sec for hours on end (turn compression
> on to 'gz').  Be amazed as scanners return 0 rows in
> 0ms from the client's point of view.  And so on.

Re: no memory tables

Posted by Ryan Rawson <ry...@gmail.com>.

Trunk is workable right now - but ymmv and there is no, and I repeat, NO
guarantee the file format won't change at a moment's notice.  I mean it -
you could svn up and your installation will be trashed, with no way to
recover it except rm -rf /hbase.

In fact, let me stress again - there is an outstanding patch that will
infact change the basic storage format (key format in hfile to be
specific).

BUT

If you are willing to toss test data in a throw-away instance, give it a
spin.  Watch as your 25 node cluster sustains 200-300k ops/sec for hours on
end (turn compression on to 'gz').  Be amazed as scanners return 0 rows in
0ms from the client's point of view.  And so on.

Last piece, on my data from earlier, I forgot to mention that the rowid size
is 16 bytes, and the data varys, but is probably no more than 20-30 bytes or
so.  Column names are 'default:0', so total size per row is like
16+20-30+8+10 = 54-64 bytes.

On Fri, Mar 27, 2009 at 1:14 AM, Andrew Purtell <ap...@apache.org> wrote:

>
> I'm really looking forward to taking HFile for a spin. Thanks so
> much for your contributions, Ryan.
>
>  - Andy
>
> > From: Ryan Rawson <ry...@gmail.com>
> > Subject: Re: no memory tables
> > To: hbase-user@hadoop.apache.org
> > Date: Thursday, March 26, 2009, 11:31 PM
> > Hey,
> >
> > Interesting ideas - there are some features in 0.20 that
> > might obviate the need for some of the suggestions below...
> >
> > One major problem with hbase 0.19 is the indexing scheme -
> > an index entry is created every 128 entries.  With large
> > data sets with small key/values, this is a major problem.
> >
> > But in hbase 0.20, the index is now based on blocks.  On my
> > own test:
> > - 1 hfile that is 161 MB on disk
> > - contains 11m key/values
> > - represents about 5.5 million rows
> > - 3.7x compression
> > - default block size (pre-compression) of 64kBytes
> > - in-memory block index size: 770kBytes.
> >
> > One problem with 0.19 is the size of in-memory indexes...
> > With hfile in 0.20 we will have many less problems.
>
>
>
>
>

Re: no memory tables

Posted by Andrew Purtell <ap...@apache.org>.

I'm really looking forward to taking HFile for a spin. Thanks so
much for your contributions, Ryan. 

  - Andy

> From: Ryan Rawson <ry...@gmail.com>
> Subject: Re: no memory tables
> To: hbase-user@hadoop.apache.org
> Date: Thursday, March 26, 2009, 11:31 PM
> Hey,
> 
> Interesting ideas - there are some features in 0.20 that
> might obviate the need for some of the suggestions below...
> 
> One major problem with hbase 0.19 is the indexing scheme -
> an index entry is created every 128 entries.  With large
> data sets with small key/values, this is a major problem.
> 
> But in hbase 0.20, the index is now based on blocks.  On my
> own test:
> - 1 hfile that is 161 MB on disk
> - contains 11m key/values
> - represents about 5.5 million rows
> - 3.7x compression
> - default block size (pre-compression) of 64kBytes
> - in-memory block index size: 770kBytes.
> 
> One problem with 0.19 is the size of in-memory indexes...
> With hfile in 0.20 we will have many less problems.

Re: no memory tables

Posted by Ryan Rawson <ry...@gmail.com>.

Hey,

Interesting ideas - there are some features in 0.20 that might obviate the
need for some of the suggestions below...

One major problem with hbase 0.19 is the indexing scheme - an index entry is
created every 128 entries.  With large data sets with small key/values, this
is a major problem.

But in hbase 0.20, the index is now based on blocks.  On my own test:
- 1 hfile that is 161 MB on disk
- contains 11m key/values
- represents about 5.5 million rows
- 3.7x compression
- default block size (pre-compression) of 64kBytes
- in-memory block index size: 770kBytes.

One problem with 0.19 is the size of in-memory indexes... With hfile in 0.20
we will have many less problems.


On Thu, Mar 26, 2009 at 11:20 PM, Billy Pearson
<sa...@pearsonwholesale.com>wrote:

> I was wondering if anyone else out there would like to use hbase to support
> storing data that does not need random access just insert/delete/scan
> If we could support a table like this that would require little to no
> memory but still allow sorted scanable updateable data to be
> stored in hbase with out the need to have index of keys in memory.
> We should still have memory usage with inserts stored in memcache but no
> key index in memory.
>
> This would allow large datasets that do not need random access to be stored
> and still give access to new/live
> data with scans with out having to merge/sort the data on disk manually
> before seeing updates.
>
> I have a large amount of data coming in that needs expired over time. I
> store in hadoop and run MR jobs over it to produce accessible index of the
> data via hbase.
> The ideal here is if I could import that data in to hbase then I can access
> subsets of the data with out having to read all the data to find what I am
> looking for.
> with this hbase could merge/sort/expire/split the data as needed and still
> give access to newly inserted data.
>
> This might take some memory on the master node but I would not thank there
> would be a limit on the size of the data except the hadoop storage size.
> Anyone else thank they could use something like this also?
>
> Billy Pearson
>
>
>
>