You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Andrew Purtell <ap...@apache.org> on 2011/09/17 18:57:34 UTC

HBase as a large, auto-partitioned, sorted, in-memory database (was: Re: prefix compression implementation)

Hi Matt,

> My motivation for doing this is to make hbase a viable candidate for a

> large, auto-partitioned, sorted, *in-memory* database.  Not the usual
> analytics use case, but i think hbase would be great for this.


Really interested in hearing your thoughts as to why HBase currently is an -- whether or not "viable" -- at least suboptimal candidate for that purpose. It has been moving in the direction of being better for that purpose ever since 0.89. Where we can further improve would be a good discussion to have, the HBase constituency is not only analytics use cases as you point out.

Best regards,

    - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)


>________________________________
>From: Matt Corgan <mc...@hotpads.com>
>To: dev@hbase.apache.org
>Sent: Friday, September 16, 2011 7:29 PM
>Subject: Re: prefix compression implementation
>
>Ryan - thanks for the feedback.  The situation I'm thinking of where it's
>useful to parse DirectBB without copying to heap is when you are serving
>small random values out of the block cache.  At HotPads, we'd like to store
>hundreds of GB of real estate listing data in memory so it can be quickly
>served up at random.  We want to access many small values that are already
>in memory, so basically skipping step 1 of 3 because values are already in
>memory.  That being said, the DirectBB are not essential for us since we
>haven't run into gb problems, i just figured it would be nice to support
>them since they seem to be important to other people.
>
>My motivation for doing this is to make hbase a viable candidate for a
>large, auto-partitioned, sorted, *in-memory* database.  Not the usual
>analytics use case, but i think hbase would be great for this.

Re: HBase as a large, auto-partitioned, sorted, in-memory database (was: Re: prefix compression implementation)

Posted by Jacek Migdal <ja...@fb.com>.

My notes.

By the way, Matt I reviewed your change it is mostly ok, at least one
trivial change is needed.

On 9/20/11 11:23 AM, "Matt Corgan" <mc...@hotpads.com> wrote:

>inline below:
>
>On Mon, Sep 19, 2011 at 10:08 PM, Stack <st...@duboce.net> wrote:
>
>> Excellent summary Matt.  Some notes in the below.
>>
>> On Sun, Sep 18, 2011 at 6:43 PM, Matt Corgan <mc...@hotpads.com>
>>wrote:
>> > ... All of this is relatively easy for the data
>> > and index blocks because they're immutable.  Doing it for the
>>memstore is
>> > another story...
>> >
>>
>> We'd need another data structure completely, wouldn't we?  Have you
>> given it any thought?
>>
>
>i've given it some thought, but i think it can wait till after the block
>format stuff is in place.  a lot of the utility methods from that can be
>used for the memstore implementation, and there should be some interfaces
>in
>place by then too.  the memstore changes are easier than the block format
>changes in a few regards: the fact that it's not persisted anywhere and
>doesn't have to be backwards compatible.

Right now, delta encoding integration aims to support that use case. There
will be three options per column family:
-on disk encoding
-in block cache encoding
-whether it should use custom seekers

As long as on disk encoding is set to none. There will be no change in
block format on disk. Moreover, no encoding/decoding will be used for
compaction. There is a tiny penalty for encoding whenever block is read
from disk, but that way it is far more flexible.

Of course, this options are currently designed primary for testing &
developing. We may want to hide, disable some of the configuration for end
users. 


>
>the closest published thing i've found to what i envision is this
>paper<http://www.google.com/url?sa=t&source=web&cd=1&sqi=2&ved=0CBwQFjAA&u
>rl=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.
>85.3498%26rep%3Drep1%26type%3Dpdf&rct=j&q=cache%20efficient%20string%20sor
>ting%20with%20copying&ei=HNd4TvLQGeWqsQKt65SsDQ&usg=AFQjCNFGNRlPcu9xykCGcn
>mCkn9tEhdtLg>about
>a copying burstsort.  it builds the memstore into a trie structure,
>but where the leaves are big byte[]'s that hold many keys until they fill
>up.  then you split the leaves (burst them).  it's cpu, memory, and
>garbage
>collector friendly.
Hmm, interesting, but our use case is a bit different. We don't need
dynamic structure . Perhaps we have something else in mind (memstore vs.
HfileBlock).

>>
>>
>> > Here's some contrived data to illustrate.  Row key is a (reversed
>> > domain) url.  Column family is called "familyName".  ColQualifier is a
>> > browser type.  Value is the number of views from that browser type.
>>  Here's
>> > the current representation <http://pastebin.com/7ks8kzJ2>, the prefix
>> trie's
>> > toString output <http://pastebin.com/cL4AkCPC> for the row keys, and
>> removing
>> > the whitespace <http://pastebin.com/4qiSXNh9> from the toString
>>output.
>>
>> Thats a big diff.
Right now I do something similar. A simple algorithm which just avoid
storing common prefix gives a very good compression ratio.

>>
>> > (random access inside HFileBlock)
>> >
>>
>> I'd imagine that we'd not want the index always, just when keys per
>> block went over some size.
>>
>> hfilev2 should help here because we don't load all indices all the time.
>>
>
>i dug into HFileV2 yesterday.  it's really great, solving the biggest
>problem i currently face: the problem where all indexes are held in
>memory.
> it also addresses my point about lack of random access inside blocks by
>adding a "secondary index", but only for the index blocks as far as i
>could
>tell.  adding a secondary index to the data blocks would be great, but
>it's
>not that interesting to me because i'd rather go a step further and fix
>the
>prefix compression problem at the same time

Right now, HFileV2 address that issue. As long as blocks are small enough,
linear seeking inside of block shouldn't be big deal.

Potentially, prefix compression is another way to solve that problem, but
it is not my mine focus.

>
>
>> > But, the prefix trie should actually be way faster than a binary
>>search
>> > because you don't have to keep comparing the beginning of the key to
>>the
>> one
>> > you're looking for.  I'll save the details for another email or for
>>code
>> > comments.  In general, with a smarter block/index format we could be
>>much
>> > more efficient than the current method of comparing full key byte[]
>>over
>> and
>> > over again.
>> >
>> > Same random access problem also applies to the block index i think
>> (correct
>> > me if i'm wrong here too).
>> >
>>
>> You should stick the above in the issue Matt, or at least refer to
>> this mail in the issue; its great stuff.
>>
>
>thanks.  i'll create some issues after i learn a little more from you guys
It seems that we already start talking about a few different things. It
would be great if we could have specific issues for them.

Jacek

Re: HBase as a large, auto-partitioned, sorted, in-memory database (was: Re: prefix compression implementation)

Posted by Matt Corgan <mc...@hotpads.com>.

inline below:

On Mon, Sep 19, 2011 at 10:08 PM, Stack <st...@duboce.net> wrote:

> Excellent summary Matt.  Some notes in the below.
>
> On Sun, Sep 18, 2011 at 6:43 PM, Matt Corgan <mc...@hotpads.com> wrote:
> > ... All of this is relatively easy for the data
> > and index blocks because they're immutable.  Doing it for the memstore is
> > another story...
> >
>
> We'd need another data structure completely, wouldn't we?  Have you
> given it any thought?
>

i've given it some thought, but i think it can wait till after the block
format stuff is in place.  a lot of the utility methods from that can be
used for the memstore implementation, and there should be some interfaces in
place by then too.  the memstore changes are easier than the block format
changes in a few regards: the fact that it's not persisted anywhere and
doesn't have to be backwards compatible.

the closest published thing i've found to what i envision is this
paper<http://www.google.com/url?sa=t&source=web&cd=1&sqi=2&ved=0CBwQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.85.3498%26rep%3Drep1%26type%3Dpdf&rct=j&q=cache%20efficient%20string%20sorting%20with%20copying&ei=HNd4TvLQGeWqsQKt65SsDQ&usg=AFQjCNFGNRlPcu9xykCGcnmCkn9tEhdtLg>about
a copying burstsort.  it builds the memstore into a trie structure,
but where the leaves are big byte[]'s that hold many keys until they fill
up.  then you split the leaves (burst them).  it's cpu, memory, and garbage
collector friendly.

>
>
> > Here's some contrived data to illustrate.  Row key is a (reversed
> > domain) url.  Column family is called "familyName".  ColQualifier is a
> > browser type.  Value is the number of views from that browser type.
>  Here's
> > the current representation <http://pastebin.com/7ks8kzJ2>, the prefix
> trie's
> > toString output <http://pastebin.com/cL4AkCPC> for the row keys, and
> removing
> > the whitespace <http://pastebin.com/4qiSXNh9> from the toString output.
>
> Thats a big diff.
>
> > The second problem is the lack of random access to cell offsets within
> the
> > data block.  (I'm not 100% sure on this one, so please correct me if i'm
> > wrong).  I noticed how bad this problem is when i was storing historical
> > event logs with 8 byte keys and small values (so ~30b per cell).  I had
> to
> > increase block size to 256KB because the block indexes were too big to
> fit
> > in memory.  Then I needed fast random access to these events.  The
> problem
> > is that there are ~10,000 cells per block, so without random lookups
> inside
> > the block, it's seeking through ~5,000 keys for the average lookup.
>  That's
> > a crazy amount of overhead to retrieve a single cell.  Probably the
> quickest
> > solution to this problem is to store the offset of each key in a list of
> > integers at the end of the block so that it's possible to do a binary
> search
> > inside the block.  That would reduce it to ~14 avg memory accesses to
> find
> > the cell.
> >
>
> I'd imagine that we'd not want the index always, just when keys per
> block went over some size.
>
> hfilev2 should help here because we don't load all indices all the time.
>

i dug into HFileV2 yesterday.  it's really great, solving the biggest
problem i currently face: the problem where all indexes are held in memory.
 it also addresses my point about lack of random access inside blocks by
adding a "secondary index", but only for the index blocks as far as i could
tell.  adding a secondary index to the data blocks would be great, but it's
not that interesting to me because i'd rather go a step further and fix the
prefix compression problem at the same time

> > But, the prefix trie should actually be way faster than a binary search
> > because you don't have to keep comparing the beginning of the key to the
> one
> > you're looking for.  I'll save the details for another email or for code
> > comments.  In general, with a smarter block/index format we could be much
> > more efficient than the current method of comparing full key byte[] over
> and
> > over again.
> >
> > Same random access problem also applies to the block index i think
> (correct
> > me if i'm wrong here too).
> >
>
> You should stick the above in the issue Matt, or at least refer to
> this mail in the issue; its great stuff.
>

thanks.  i'll create some issues after i learn a little more from you guys

>
> Thanks,
> St.Ack
>

Re: HBase as a large, auto-partitioned, sorted, in-memory database (was: Re: prefix compression implementation)

Posted by Stack <st...@duboce.net>.

Excellent summary Matt.  Some notes in the below.

On Sun, Sep 18, 2011 at 6:43 PM, Matt Corgan <mc...@hotpads.com> wrote:
> ... All of this is relatively easy for the data
> and index blocks because they're immutable.  Doing it for the memstore is
> another story...
>

We'd need another data structure completely, wouldn't we?  Have you
given it any thought?


> Here's some contrived data to illustrate.  Row key is a (reversed
> domain) url.  Column family is called "familyName".  ColQualifier is a
> browser type.  Value is the number of views from that browser type.  Here's
> the current representation <http://pastebin.com/7ks8kzJ2>, the prefix trie's
> toString output <http://pastebin.com/cL4AkCPC> for the row keys, and removing
> the whitespace <http://pastebin.com/4qiSXNh9> from the toString output.

Thats a big diff.

> The second problem is the lack of random access to cell offsets within the
> data block.  (I'm not 100% sure on this one, so please correct me if i'm
> wrong).  I noticed how bad this problem is when i was storing historical
> event logs with 8 byte keys and small values (so ~30b per cell).  I had to
> increase block size to 256KB because the block indexes were too big to fit
> in memory.  Then I needed fast random access to these events.  The problem
> is that there are ~10,000 cells per block, so without random lookups inside
> the block, it's seeking through ~5,000 keys for the average lookup.  That's
> a crazy amount of overhead to retrieve a single cell.  Probably the quickest
> solution to this problem is to store the offset of each key in a list of
> integers at the end of the block so that it's possible to do a binary search
> inside the block.  That would reduce it to ~14 avg memory accesses to find
> the cell.
>

I'd imagine that we'd not want the index always, just when keys per
block went over some size.

hfilev2 should help here because we don't load all indices all the time.

> But, the prefix trie should actually be way faster than a binary search
> because you don't have to keep comparing the beginning of the key to the one
> you're looking for.  I'll save the details for another email or for code
> comments.  In general, with a smarter block/index format we could be much
> more efficient than the current method of comparing full key byte[] over and
> over again.
>
> Same random access problem also applies to the block index i think (correct
> me if i'm wrong here too).
>

You should stick the above in the issue Matt, or at least refer to
this mail in the issue; its great stuff.

Thanks,
St.Ack

Re: HBase as a large, auto-partitioned, sorted, in-memory database (was: Re: prefix compression implementation)

Posted by Matt Corgan <mc...@hotpads.com>.

I think there's 2 major issues right now, both related to the data and index
block formats.

First problem is the amount of duplicate information in the uncompressed
block format.  Taking 100GB of hot data from mysql and serving it out of
hbase might require 1-2TB of memory.  The main culprit there is the fact
that the row key is repeated for every cell.  The prefix trie not only
eliminates that penalty, but also removes the duplicate information between
consecutive row keys.  Then it builds a reverse trie of column qualifiers so
each cell can reference it's qualifier with a 1-2 byte index.  Timestamps
and op types can often be eliminated, and most integers inside the cell can
be represented in 1-2 bytes.  All of this is relatively easy for the data
and index blocks because they're immutable.  Doing it for the memstore is
another story...

Here's some contrived data to illustrate.  Row key is a (reversed
domain) url.  Column family is called "familyName".  ColQualifier is a
browser type.  Value is the number of views from that browser type.  Here's
the current representation <http://pastebin.com/7ks8kzJ2>, the prefix trie's
toString output <http://pastebin.com/cL4AkCPC> for the row keys, and removing
the whitespace <http://pastebin.com/4qiSXNh9> from the toString output.  The
current version is really wasteful.  Another example is storing counters
with long names.  OpenTSDB does this elegantly, but with a lot of work to
get around these issues.  It would be nice to not have to do all the
workarounds.

The second problem is the lack of random access to cell offsets within the
data block.  (I'm not 100% sure on this one, so please correct me if i'm
wrong).  I noticed how bad this problem is when i was storing historical
event logs with 8 byte keys and small values (so ~30b per cell).  I had to
increase block size to 256KB because the block indexes were too big to fit
in memory.  Then I needed fast random access to these events.  The problem
is that there are ~10,000 cells per block, so without random lookups inside
the block, it's seeking through ~5,000 keys for the average lookup.  That's
a crazy amount of overhead to retrieve a single cell.  Probably the quickest
solution to this problem is to store the offset of each key in a list of
integers at the end of the block so that it's possible to do a binary search
inside the block.  That would reduce it to ~14 avg memory accesses to find
the cell.

But, the prefix trie should actually be way faster than a binary search
because you don't have to keep comparing the beginning of the key to the one
you're looking for.  I'll save the details for another email or for code
comments.  In general, with a smarter block/index format we could be much
more efficient than the current method of comparing full key byte[] over and
over again.

Same random access problem also applies to the block index i think (correct
me if i'm wrong here too).

On Sat, Sep 17, 2011 at 9:57 AM, Andrew Purtell <ap...@apache.org> wrote:

> Hi Matt,
>
> > My motivation for doing this is to make hbase a viable candidate for a
>
> > large, auto-partitioned, sorted, *in-memory* database.  Not the usual
> > analytics use case, but i think hbase would be great for this.
>
>
> Really interested in hearing your thoughts as to why HBase currently is an
> -- whether or not "viable" -- at least suboptimal candidate for that
> purpose. It has been moving in the direction of being better for that
> purpose ever since 0.89. Where we can further improve would be a good
> discussion to have, the HBase constituency is not only analytics use cases
> as you point out.
>
> Best regards,
>
>     - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
>
> >________________________________
> >From: Matt Corgan <mc...@hotpads.com>
> >To: dev@hbase.apache.org
> >Sent: Friday, September 16, 2011 7:29 PM
> >Subject: Re: prefix compression implementation
> >
> >Ryan - thanks for the feedback.  The situation I'm thinking of where it's
> >useful to parse DirectBB without copying to heap is when you are serving
> >small random values out of the block cache.  At HotPads, we'd like to
> store
> >hundreds of GB of real estate listing data in memory so it can be quickly
> >served up at random.  We want to access many small values that are already
> >in memory, so basically skipping step 1 of 3 because values are already in
> >memory.  That being said, the DirectBB are not essential for us since we
> >haven't run into gb problems, i just figured it would be nice to support
> >them since they seem to be important to other people.
> >
> >My motivation for doing this is to make hbase a viable candidate for a
> >large, auto-partitioned, sorted, *in-memory* database.  Not the usual
> >analytics use case, but i think hbase would be great for this.
>

HBase as a large, auto-partitioned, sorted, *in-memory* database (was: Re: prefix compression implementation)

Re: HBase as a large, auto-partitioned, sorted, *in-memory* database (was: Re: prefix compression implementation)

Re: HBase as a large, auto-partitioned, sorted, *in-memory* database (was: Re: prefix compression implementation)

Re: HBase as a large, auto-partitioned, sorted, *in-memory* database (was: Re: prefix compression implementation)

Re: HBase as a large, auto-partitioned, sorted, *in-memory* database (was: Re: prefix compression implementation)

HBase as a large, auto-partitioned, sorted, in-memory database (was: Re: prefix compression implementation)

Re: HBase as a large, auto-partitioned, sorted, in-memory database (was: Re: prefix compression implementation)

Re: HBase as a large, auto-partitioned, sorted, in-memory database (was: Re: prefix compression implementation)

Re: HBase as a large, auto-partitioned, sorted, in-memory database (was: Re: prefix compression implementation)

Re: HBase as a large, auto-partitioned, sorted, in-memory database (was: Re: prefix compression implementation)