You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Michael Cafarella <mi...@gmail.com> on 2006/05/15 00:00:03 UTC

HBase Design Ideas, Part I

Hi everyone,

I've written up a design that I've been working on for a little bit, for
a project I'll call "HBase".  The idea is for Hadoop to implement something
similar in spirit to BigTable.  That is, a distributed data store that
places a greater emphasis on scalability than on SQL compatibility
or traditional transactional correctness.

BigTable is neither completely described anywhere, nor is it
necessarily exactly what we want.  So I'm not trying to clone BigTable,
but I am going to draw on it a lot.

My personal view is that BigTable is a great "physical layer" but not yet
a great database system.  A major thing it lacks is a good query language.
Another, freely admitted by the Google people, is any kind of inter-row
locking.  I'm not going to try to solve all these problems, but I would
like HBase to be extendible enough that it's easy to add new query
languages or primitives.

In this mail, I'll describe a system that's pretty similar to BigTable.
I'll send a second one that describes what we might want to change
or add.

Please let me know what you think!

Thanks,
--Mike

--------------------------------------------------------------------------------
I.  Table semantics

An HBase consists of one or more HTables.  An HTable is a list of rows,
sorted alphabetically by "row name".  An HTable also has a series of
"columns."  A row may or may not contain a value for a column.  The
HTable representation is sparse, so if a row does not contain a value
for a given column, there is no storage overhead.

(Thus, there's not really a "schema" to an HTable.  Every operation, even
adding a column, is considered a row-centric operation.)

The "current version" of a row is always available, timestamped with its
last modification date.  The system may also store previous versions of a row,
according to how the HTable is configured.

Updates to a single row are always atomic and can affect one or more columns.

II.  System layout

HTables are partitionable into contiguous row regions called HRegions.
All machines in a pool run an HRegionServer.  A given HRegion is served
to clients by a single HRegionServer.  A single HRegionServer may be
responsible for many HRegions.  The HRegions for a single HTable will
be scattered across arbitrary HRegionServers.

When a client wants to add/delete/update a row value, it must locate the
relevant HRegionServer.  It then contacts the HRegionServer and communicates
the updates.  There may be other steps, mainly lock-oriented ones.  But locating
the relevant HRegionServers is a bare minimum.

The HBase system can repartition an HTable at any time.  For example, many
repeated inserts at a single location may cause a single HRegion to grow
very large.  The HBase would then try to split that into multiple HRegions.
Those HRegions may be served by the same HRegionServer as the
original or may be served by a different one.

Each HRegionServer sends a regular heartbeat to an HBaseMaster machine.
If the heartbeat for an HRegionServer fails, then the HBaseMaster is responsible
for reassigning its HRegions to other available HRegionServers.

All HRegions are stored within DFS, so the HRegion is always available, even
in the face of machine failures.  The HRegionServers and DFS DataNodes run
on the same set of machines.  We would like for an HRegionServer to always
serve data stored locally, but that is not guaranteed when using DFS.  We can
encourage it by:
1) In the event of an insert-motivated HRegion move, the new HRegionServer
should always create a new DFS file for the new HRegion.  The DFS rules of
thumb will allocate the chunks locally for the HRegionServer.
2) In the even of a machine failure, we cannot do anything similar to above.
Instead, the HBaseMaster can ask DFS for hints as to where the relevant
file blocks are stored.  If possible, it will allocate the new
HRegions to servers
that physically contain the HRegion.
3) If necessary, we could add an API to DFS that demands block replication
to a given node.  I'd like to avoid this if possible.

The mapping from row to HRegion (and hence, to HRegionServer) is itself
stored in a special HTable.  The HBaseMaster is the only client allowed to
write to this HTable.  This special HTable may itself be split into several
HRegions.  However, we only allow a hard-coded number of split-levels.
The top level of this hierarchy must be easily-stored on a single machine.
That top-level table is always served by the HBaseMaster itself.

III.  Client behavior

Let's think about what happens when a client wants to add a row.
1) The client must compute what HRegion is responsible for the key
it wants to insert into the HTable.  It must navigate the row->HRegion
mapping, which is stored in an HTable.

So the client first contacts the HBaseMaster for the top-level table contents.
It then steps downward through the table set, until it finds the mapping for
the target row.

2) The client contacts the HRegionServer responsible for the target row,
and asks to insert.  If the HRegionServer is no longer responsible for the
relevant HRegion, it returns a failure message and tells the client to go
back to step 1 to find the new correct HRegionServer.

If the HRegionServer is the right place to go, it accepts the new row from
the client.  The HRegionServer guarantees that the insert is atomic; it
will not intermingle the insert with a competing insert for the same row key.
When the row is stored, the HRegionServer includes version and timestamp
information.

3) That's it!

IV The HRegionServer

Maintaining the data for a single HRegion is slightly complicated.  It's
especially weird given the write-once semantics of DFS.  There are
three important moving parts:

1) HBackedStore is a file-backed store for rows and their values.
It is never edited in place.  It has B-Tree-like lookups for finding
a row quickly.  HBackedStore is actually a series of on-disk stores,
each store being tuned for a certain object size.  Thus, all the "small"
(in bytes) values for a row live within the same file, all the medium
ones live in a separate file, etc.  There is only one HBackedStore
for any single HRegion.

2) HUpdateLog is a log of updates to the HBackedStore.  It is backed
by an on-disk file.  When making reads from the HBackedStore, it may
be necessary to consult the HUpdateLog to see if any more-recent
updates have been made.  There may be a series of HUpdateLogs
for a single HRegion.

3) HUpdateBuf is an in-memory version of HUpdateLog.  It, too, needs
to be consulted whenever performing a read.  There is only one
HUpdateBuf for a single HRegion.

Any incoming edit is first made directly to the HUpdateBuf.  Changes
made to the HUpdateBuf are volatile until flushed to an HUpdateLog.
The rate of flushes is an admin-configurable parameter.

Periodically, the HBackedStore and the series of current HUpdateLogs
are merged to form a new HBackedStore.  At that point, the old HUpdateLog
objects can be destroyed.  During this compaction process, edits are
made to the HUpdateBuf.

Re: HBase Design Ideas, Part I

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi,

I collect some stuff  (in a hurry)  that was the result of playing  
around with this topic so please expect just nothing.
The tests do not pass and there are several known bugs, however that  
may be give you an idea what I had worked on.

Re: HBase Design Ideas, Part I

Posted by Stefan Groschupf <sg...@media-style.com>.

> Seems like you are exporting a lot of complexity to the clients by  
> having them find the table chunks via DFS read.  Lots of data  
> motion and sync / cache issues there.  When not just ask the master  
> for the block/server of a key?  Or you could distribute this work  
> over your HRegionServers if you don't want to stress the master.   
> All this could be kept fresh in RAM there (segmented if you get  
> huge). [but this adds complexity]

I see the possibility as described in the talk to get key regions  
form the master, than ask the next box for a key region  etc. So you  
can distribute the key region table over a set of boxes.
In case the client cache that the load on master shouldn't be that hard.
Alternative I see a chance of request an update with a given key and  
get a kind of forward message returned until the client find the  
region server that hosts the data for a given row key.
However since the client needs to start at the master, this would be  
a lot of load. A combination of both could be interesting.
>
> As I read you design, it sounds like you might be doing a lot of  
> seeks to find a record (do you need to scan all the logs to see if  
> a key is present?).  Best to outline the performance you want and  
> then look at the ram / disk trade-offs.  IE you can store  
> everything in BTRees, but then you will thrash you disks.  Or you  
> can store everything linearly and store all your unmerged entries  
> in RAM.  This would have different costs/benefits...
I agree there should be only one 'memory backed together' write cache  
that needs to be consulted before you search in the last checkpoint  
file.
In any case we can write also one log to make sure we lose no data  
when a box crash.
... just my 2 cents.

Stefan
>
>
> On May 14, 2006, at 3:00 PM, Michael Cafarella wrote:
>
>> Hi everyone,
>>
>> I've written up a design that I've been working on for a little  
>> bit, for
>> a project I'll call "HBase".  The idea is for Hadoop to implement  
>> something
>> similar in spirit to BigTable.  That is, a distributed data store  
>> that
>> places a greater emphasis on scalability than on SQL compatibility
>> or traditional transactional correctness.
>>
>> BigTable is neither completely described anywhere, nor is it
>> necessarily exactly what we want.  So I'm not trying to clone  
>> BigTable,
>> but I am going to draw on it a lot.
>>
>> My personal view is that BigTable is a great "physical layer" but  
>> not yet
>> a great database system.  A major thing it lacks is a good query  
>> language.
>> Another, freely admitted by the Google people, is any kind of  
>> inter-row
>> locking.  I'm not going to try to solve all these problems, but I  
>> would
>> like HBase to be extendible enough that it's easy to add new query
>> languages or primitives.
>>
>> In this mail, I'll describe a system that's pretty similar to  
>> BigTable.
>> I'll send a second one that describes what we might want to change
>> or add.
>>
>> Please let me know what you think!
>>
>> Thanks,
>> --Mike
>>
>> --------------------------------------------------------------------- 
>> -----------
>> I.  Table semantics
>>
>> An HBase consists of one or more HTables.  An HTable is a list of  
>> rows,
>> sorted alphabetically by "row name".  An HTable also has a series of
>> "columns."  A row may or may not contain a value for a column.  The
>> HTable representation is sparse, so if a row does not contain a value
>> for a given column, there is no storage overhead.
>>
>> (Thus, there's not really a "schema" to an HTable.  Every  
>> operation, even
>> adding a column, is considered a row-centric operation.)
>>
>> The "current version" of a row is always available, timestamped  
>> with its
>> last modification date.  The system may also store previous  
>> versions of a row,
>> according to how the HTable is configured.
>>
>> Updates to a single row are always atomic and can affect one or  
>> more columns.
>>
>> II.  System layout
>>
>> HTables are partitionable into contiguous row regions called  
>> HRegions.
>> All machines in a pool run an HRegionServer.  A given HRegion is  
>> served
>> to clients by a single HRegionServer.  A single HRegionServer may be
>> responsible for many HRegions.  The HRegions for a single HTable will
>> be scattered across arbitrary HRegionServers.
>>
>> When a client wants to add/delete/update a row value, it must  
>> locate the
>> relevant HRegionServer.  It then contacts the HRegionServer and  
>> communicates
>> the updates.  There may be other steps, mainly lock-oriented  
>> ones.  But locating
>> the relevant HRegionServers is a bare minimum.
>>
>> The HBase system can repartition an HTable at any time.  For  
>> example, many
>> repeated inserts at a single location may cause a single HRegion  
>> to grow
>> very large.  The HBase would then try to split that into multiple  
>> HRegions.
>> Those HRegions may be served by the same HRegionServer as the
>> original or may be served by a different one.
>>
>> Each HRegionServer sends a regular heartbeat to an HBaseMaster  
>> machine.
>> If the heartbeat for an HRegionServer fails, then the HBaseMaster  
>> is responsible
>> for reassigning its HRegions to other available HRegionServers.
>>
>> All HRegions are stored within DFS, so the HRegion is always  
>> available, even
>> in the face of machine failures.  The HRegionServers and DFS  
>> DataNodes run
>> on the same set of machines.  We would like for an HRegionServer  
>> to always
>> serve data stored locally, but that is not guaranteed when using  
>> DFS.  We can
>> encourage it by:
>> 1) In the event of an insert-motivated HRegion move, the new  
>> HRegionServer
>> should always create a new DFS file for the new HRegion.  The DFS  
>> rules of
>> thumb will allocate the chunks locally for the HRegionServer.
>> 2) In the even of a machine failure, we cannot do anything similar  
>> to above.
>> Instead, the HBaseMaster can ask DFS for hints as to where the  
>> relevant
>> file blocks are stored.  If possible, it will allocate the new
>> HRegions to servers
>> that physically contain the HRegion.
>> 3) If necessary, we could add an API to DFS that demands block  
>> replication
>> to a given node.  I'd like to avoid this if possible.
>>
>> The mapping from row to HRegion (and hence, to HRegionServer) is  
>> itself
>> stored in a special HTable.  The HBaseMaster is the only client  
>> allowed to
>> write to this HTable.  This special HTable may itself be split  
>> into several
>> HRegions.  However, we only allow a hard-coded number of split- 
>> levels.
>> The top level of this hierarchy must be easily-stored on a single  
>> machine.
>> That top-level table is always served by the HBaseMaster itself.
>>
>> III.  Client behavior
>>
>> Let's think about what happens when a client wants to add a row.
>> 1) The client must compute what HRegion is responsible for the key
>> it wants to insert into the HTable.  It must navigate the row- 
>> >HRegion
>> mapping, which is stored in an HTable.
>>
>> So the client first contacts the HBaseMaster for the top-level  
>> table contents.
>> It then steps downward through the table set, until it finds the  
>> mapping for
>> the target row.
>>
>> 2) The client contacts the HRegionServer responsible for the  
>> target row,
>> and asks to insert.  If the HRegionServer is no longer responsible  
>> for the
>> relevant HRegion, it returns a failure message and tells the  
>> client to go
>> back to step 1 to find the new correct HRegionServer.
>>
>> If the HRegionServer is the right place to go, it accepts the new  
>> row from
>> the client.  The HRegionServer guarantees that the insert is  
>> atomic; it
>> will not intermingle the insert with a competing insert for the  
>> same row key.
>> When the row is stored, the HRegionServer includes version and  
>> timestamp
>> information.
>>
>> 3) That's it!
>>
>> IV The HRegionServer
>>
>> Maintaining the data for a single HRegion is slightly  
>> complicated.  It's
>> especially weird given the write-once semantics of DFS.  There are
>> three important moving parts:
>>
>> 1) HBackedStore is a file-backed store for rows and their values.
>> It is never edited in place.  It has B-Tree-like lookups for finding
>> a row quickly.  HBackedStore is actually a series of on-disk stores,
>> each store being tuned for a certain object size.  Thus, all the  
>> "small"
>> (in bytes) values for a row live within the same file, all the medium
>> ones live in a separate file, etc.  There is only one HBackedStore
>> for any single HRegion.
>>
>> 2) HUpdateLog is a log of updates to the HBackedStore.  It is backed
>> by an on-disk file.  When making reads from the HBackedStore, it may
>> be necessary to consult the HUpdateLog to see if any more-recent
>> updates have been made.  There may be a series of HUpdateLogs
>> for a single HRegion.
>>
>> 3) HUpdateBuf is an in-memory version of HUpdateLog.  It, too, needs
>> to be consulted whenever performing a read.  There is only one
>> HUpdateBuf for a single HRegion.
>>
>> Any incoming edit is first made directly to the HUpdateBuf.  Changes
>> made to the HUpdateBuf are volatile until flushed to an HUpdateLog.
>> The rate of flushes is an admin-configurable parameter.
>>
>> Periodically, the HBackedStore and the series of current HUpdateLogs
>> are merged to form a new HBackedStore.  At that point, the old  
>> HUpdateLog
>> objects can be destroyed.  During this compaction process, edits are
>> made to the HUpdateBuf.
>
>

Re: HBase Design Ideas, Part I

Posted by Michael Cafarella <mi...@gmail.com>.

On 5/14/06, Eric Baldeschwieler <er...@yahoo-inc.com> wrote:
>
> Seems like you are exporting a lot of complexity to the clients by
> having them find the table chunks via DFS read.  Lots of data motion
> and sync / cache issues there.  When not just ask the master for the
> block/server of a key?  Or you could distribute this work over your
> HRegionServers if you don't want to stress the master.  All this
> could be kept fresh in RAM there (segmented if you get huge). [but
> this adds complexity]


Sorry, maybe the text was confusing.  I didn't mean for a client to
ever directly read a DFS file.  The stored files should only be served
by HRegionServers, directly to clients.  Clients might have to go through
a level of indirection to find the right HRegionServer, but that should
only happen once before being cached.

I imagine in practice that most of the key-->HRegionServer mapping
will end up being cached in memory.


As I read you design, it sounds like you might be doing a lot of
> seeks to find a record (do you need to scan all the logs to see if a
> key is present?).  Best to outline the performance you want and then
> look at the ram / disk trade-offs.  IE you can store everything in
> BTRees, but then you will thrash you disks.  Or you can store
> everything linearly and store all your unmerged entries in RAM.  This
> would have different costs/benefits...


Checking for the existence of a key should require at most one
sort.  To check for the presence of a key, you need to locate the relevant
HRegionServer, then test to see if the key is stored there.

The suggestion is to keep most data in read-only files where the
start and end keys for a page are stored in an external map.  True, you
need a seek to test for the existence of a key, but there's no way to
avoid that unless all keys are stored in memory.  Items are stored
in-order, so linear scans will be very quick.  You never edit one of
these structures in place.

Edits are kept in memory and periodically logged to disk.  Every so
often, we generate a brand-new read-only structure by merging the logs and
the
existing read-only structure.

Writes are fast, being logged to memory and lazily to disk.
A read lookup takes at most one disk seek.  A scan is very efficient,
requiring no unnecessary seeks.  Read operations may become a little
complicated, as we have to integrate the read-only structure, the log
files, and the memory buffer.  However, since all of these extra operations
will be in-memory, there should be no problem performance-wise.  (One
thing to worry about is the edits that have been logged to disk already.
We should probably compact often enough so that edits can always
fit in memory comfortably.)


On May 14, 2006, at 3:00 PM, Michael Cafarella wrote:
>
> > Hi everyone,
> >
> > I've written up a design that I've been working on for a little
> > bit, for
> > a project I'll call "HBase".  The idea is for Hadoop to implement
> > something
> > similar in spirit to BigTable.  That is, a distributed data store that
> > places a greater emphasis on scalability than on SQL compatibility
> > or traditional transactional correctness.
> >
> > BigTable is neither completely described anywhere, nor is it
> > necessarily exactly what we want.  So I'm not trying to clone
> > BigTable,
> > but I am going to draw on it a lot.
> >
> > My personal view is that BigTable is a great "physical layer" but
> > not yet
> > a great database system.  A major thing it lacks is a good query
> > language.
> > Another, freely admitted by the Google people, is any kind of inter-
> > row
> > locking.  I'm not going to try to solve all these problems, but I
> > would
> > like HBase to be extendible enough that it's easy to add new query
> > languages or primitives.
> >
> > In this mail, I'll describe a system that's pretty similar to
> > BigTable.
> > I'll send a second one that describes what we might want to change
> > or add.
> >
> > Please let me know what you think!
> >
> > Thanks,
> > --Mike
> >
> > ----------------------------------------------------------------------
> > ----------
> > I.  Table semantics
> >
> > An HBase consists of one or more HTables.  An HTable is a list of
> > rows,
> > sorted alphabetically by "row name".  An HTable also has a series of
> > "columns."  A row may or may not contain a value for a column.  The
> > HTable representation is sparse, so if a row does not contain a value
> > for a given column, there is no storage overhead.
> >
> > (Thus, there's not really a "schema" to an HTable.  Every
> > operation, even
> > adding a column, is considered a row-centric operation.)
> >
> > The "current version" of a row is always available, timestamped
> > with its
> > last modification date.  The system may also store previous
> > versions of a row,
> > according to how the HTable is configured.
> >
> > Updates to a single row are always atomic and can affect one or
> > more columns.
> >
> > II.  System layout
> >
> > HTables are partitionable into contiguous row regions called HRegions.
> > All machines in a pool run an HRegionServer.  A given HRegion is
> > served
> > to clients by a single HRegionServer.  A single HRegionServer may be
> > responsible for many HRegions.  The HRegions for a single HTable will
> > be scattered across arbitrary HRegionServers.
> >
> > When a client wants to add/delete/update a row value, it must
> > locate the
> > relevant HRegionServer.  It then contacts the HRegionServer and
> > communicates
> > the updates.  There may be other steps, mainly lock-oriented ones.
> > But locating
> > the relevant HRegionServers is a bare minimum.
> >
> > The HBase system can repartition an HTable at any time.  For
> > example, many
> > repeated inserts at a single location may cause a single HRegion to
> > grow
> > very large.  The HBase would then try to split that into multiple
> > HRegions.
> > Those HRegions may be served by the same HRegionServer as the
> > original or may be served by a different one.
> >
> > Each HRegionServer sends a regular heartbeat to an HBaseMaster
> > machine.
> > If the heartbeat for an HRegionServer fails, then the HBaseMaster
> > is responsible
> > for reassigning its HRegions to other available HRegionServers.
> >
> > All HRegions are stored within DFS, so the HRegion is always
> > available, even
> > in the face of machine failures.  The HRegionServers and DFS
> > DataNodes run
> > on the same set of machines.  We would like for an HRegionServer to
> > always
> > serve data stored locally, but that is not guaranteed when using
> > DFS.  We can
> > encourage it by:
> > 1) In the event of an insert-motivated HRegion move, the new
> > HRegionServer
> > should always create a new DFS file for the new HRegion.  The DFS
> > rules of
> > thumb will allocate the chunks locally for the HRegionServer.
> > 2) In the even of a machine failure, we cannot do anything similar
> > to above.
> > Instead, the HBaseMaster can ask DFS for hints as to where the
> > relevant
> > file blocks are stored.  If possible, it will allocate the new
> > HRegions to servers
> > that physically contain the HRegion.
> > 3) If necessary, we could add an API to DFS that demands block
> > replication
> > to a given node.  I'd like to avoid this if possible.
> >
> > The mapping from row to HRegion (and hence, to HRegionServer) is
> > itself
> > stored in a special HTable.  The HBaseMaster is the only client
> > allowed to
> > write to this HTable.  This special HTable may itself be split into
> > several
> > HRegions.  However, we only allow a hard-coded number of split-levels.
> > The top level of this hierarchy must be easily-stored on a single
> > machine.
> > That top-level table is always served by the HBaseMaster itself.
> >
> > III.  Client behavior
> >
> > Let's think about what happens when a client wants to add a row.
> > 1) The client must compute what HRegion is responsible for the key
> > it wants to insert into the HTable.  It must navigate the row->HRegion
> > mapping, which is stored in an HTable.
> >
> > So the client first contacts the HBaseMaster for the top-level
> > table contents.
> > It then steps downward through the table set, until it finds the
> > mapping for
> > the target row.
> >
> > 2) The client contacts the HRegionServer responsible for the target
> > row,
> > and asks to insert.  If the HRegionServer is no longer responsible
> > for the
> > relevant HRegion, it returns a failure message and tells the client
> > to go
> > back to step 1 to find the new correct HRegionServer.
> >
> > If the HRegionServer is the right place to go, it accepts the new
> > row from
> > the client.  The HRegionServer guarantees that the insert is
> > atomic; it
> > will not intermingle the insert with a competing insert for the
> > same row key.
> > When the row is stored, the HRegionServer includes version and
> > timestamp
> > information.
> >
> > 3) That's it!
> >
> > IV The HRegionServer
> >
> > Maintaining the data for a single HRegion is slightly complicated.
> > It's
> > especially weird given the write-once semantics of DFS.  There are
> > three important moving parts:
> >
> > 1) HBackedStore is a file-backed store for rows and their values.
> > It is never edited in place.  It has B-Tree-like lookups for finding
> > a row quickly.  HBackedStore is actually a series of on-disk stores,
> > each store being tuned for a certain object size.  Thus, all the
> > "small"
> > (in bytes) values for a row live within the same file, all the medium
> > ones live in a separate file, etc.  There is only one HBackedStore
> > for any single HRegion.
> >
> > 2) HUpdateLog is a log of updates to the HBackedStore.  It is backed
> > by an on-disk file.  When making reads from the HBackedStore, it may
> > be necessary to consult the HUpdateLog to see if any more-recent
> > updates have been made.  There may be a series of HUpdateLogs
> > for a single HRegion.
> >
> > 3) HUpdateBuf is an in-memory version of HUpdateLog.  It, too, needs
> > to be consulted whenever performing a read.  There is only one
> > HUpdateBuf for a single HRegion.
> >
> > Any incoming edit is first made directly to the HUpdateBuf.  Changes
> > made to the HUpdateBuf are volatile until flushed to an HUpdateLog.
> > The rate of flushes is an admin-configurable parameter.
> >
> > Periodically, the HBackedStore and the series of current HUpdateLogs
> > are merged to form a new HBackedStore.  At that point, the old
> > HUpdateLog
> > objects can be destroyed.  During this compaction process, edits are
> > made to the HUpdateBuf.
>
>

Re: HBase Design Ideas, Part I

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

Seems like you are exporting a lot of complexity to the clients by  
having them find the table chunks via DFS read.  Lots of data motion  
and sync / cache issues there.  When not just ask the master for the  
block/server of a key?  Or you could distribute this work over your  
HRegionServers if you don't want to stress the master.  All this  
could be kept fresh in RAM there (segmented if you get huge). [but  
this adds complexity]

As I read you design, it sounds like you might be doing a lot of  
seeks to find a record (do you need to scan all the logs to see if a  
key is present?).  Best to outline the performance you want and then  
look at the ram / disk trade-offs.  IE you can store everything in  
BTRees, but then you will thrash you disks.  Or you can store  
everything linearly and store all your unmerged entries in RAM.  This  
would have different costs/benefits...


On May 14, 2006, at 3:00 PM, Michael Cafarella wrote:

> Hi everyone,
>
> I've written up a design that I've been working on for a little  
> bit, for
> a project I'll call "HBase".  The idea is for Hadoop to implement  
> something
> similar in spirit to BigTable.  That is, a distributed data store that
> places a greater emphasis on scalability than on SQL compatibility
> or traditional transactional correctness.
>
> BigTable is neither completely described anywhere, nor is it
> necessarily exactly what we want.  So I'm not trying to clone  
> BigTable,
> but I am going to draw on it a lot.
>
> My personal view is that BigTable is a great "physical layer" but  
> not yet
> a great database system.  A major thing it lacks is a good query  
> language.
> Another, freely admitted by the Google people, is any kind of inter- 
> row
> locking.  I'm not going to try to solve all these problems, but I  
> would
> like HBase to be extendible enough that it's easy to add new query
> languages or primitives.
>
> In this mail, I'll describe a system that's pretty similar to  
> BigTable.
> I'll send a second one that describes what we might want to change
> or add.
>
> Please let me know what you think!
>
> Thanks,
> --Mike
>
> ---------------------------------------------------------------------- 
> ----------
> I.  Table semantics
>
> An HBase consists of one or more HTables.  An HTable is a list of  
> rows,
> sorted alphabetically by "row name".  An HTable also has a series of
> "columns."  A row may or may not contain a value for a column.  The
> HTable representation is sparse, so if a row does not contain a value
> for a given column, there is no storage overhead.
>
> (Thus, there's not really a "schema" to an HTable.  Every  
> operation, even
> adding a column, is considered a row-centric operation.)
>
> The "current version" of a row is always available, timestamped  
> with its
> last modification date.  The system may also store previous  
> versions of a row,
> according to how the HTable is configured.
>
> Updates to a single row are always atomic and can affect one or  
> more columns.
>
> II.  System layout
>
> HTables are partitionable into contiguous row regions called HRegions.
> All machines in a pool run an HRegionServer.  A given HRegion is  
> served
> to clients by a single HRegionServer.  A single HRegionServer may be
> responsible for many HRegions.  The HRegions for a single HTable will
> be scattered across arbitrary HRegionServers.
>
> When a client wants to add/delete/update a row value, it must  
> locate the
> relevant HRegionServer.  It then contacts the HRegionServer and  
> communicates
> the updates.  There may be other steps, mainly lock-oriented ones.   
> But locating
> the relevant HRegionServers is a bare minimum.
>
> The HBase system can repartition an HTable at any time.  For  
> example, many
> repeated inserts at a single location may cause a single HRegion to  
> grow
> very large.  The HBase would then try to split that into multiple  
> HRegions.
> Those HRegions may be served by the same HRegionServer as the
> original or may be served by a different one.
>
> Each HRegionServer sends a regular heartbeat to an HBaseMaster  
> machine.
> If the heartbeat for an HRegionServer fails, then the HBaseMaster  
> is responsible
> for reassigning its HRegions to other available HRegionServers.
>
> All HRegions are stored within DFS, so the HRegion is always  
> available, even
> in the face of machine failures.  The HRegionServers and DFS  
> DataNodes run
> on the same set of machines.  We would like for an HRegionServer to  
> always
> serve data stored locally, but that is not guaranteed when using  
> DFS.  We can
> encourage it by:
> 1) In the event of an insert-motivated HRegion move, the new  
> HRegionServer
> should always create a new DFS file for the new HRegion.  The DFS  
> rules of
> thumb will allocate the chunks locally for the HRegionServer.
> 2) In the even of a machine failure, we cannot do anything similar  
> to above.
> Instead, the HBaseMaster can ask DFS for hints as to where the  
> relevant
> file blocks are stored.  If possible, it will allocate the new
> HRegions to servers
> that physically contain the HRegion.
> 3) If necessary, we could add an API to DFS that demands block  
> replication
> to a given node.  I'd like to avoid this if possible.
>
> The mapping from row to HRegion (and hence, to HRegionServer) is  
> itself
> stored in a special HTable.  The HBaseMaster is the only client  
> allowed to
> write to this HTable.  This special HTable may itself be split into  
> several
> HRegions.  However, we only allow a hard-coded number of split-levels.
> The top level of this hierarchy must be easily-stored on a single  
> machine.
> That top-level table is always served by the HBaseMaster itself.
>
> III.  Client behavior
>
> Let's think about what happens when a client wants to add a row.
> 1) The client must compute what HRegion is responsible for the key
> it wants to insert into the HTable.  It must navigate the row->HRegion
> mapping, which is stored in an HTable.
>
> So the client first contacts the HBaseMaster for the top-level  
> table contents.
> It then steps downward through the table set, until it finds the  
> mapping for
> the target row.
>
> 2) The client contacts the HRegionServer responsible for the target  
> row,
> and asks to insert.  If the HRegionServer is no longer responsible  
> for the
> relevant HRegion, it returns a failure message and tells the client  
> to go
> back to step 1 to find the new correct HRegionServer.
>
> If the HRegionServer is the right place to go, it accepts the new  
> row from
> the client.  The HRegionServer guarantees that the insert is  
> atomic; it
> will not intermingle the insert with a competing insert for the  
> same row key.
> When the row is stored, the HRegionServer includes version and  
> timestamp
> information.
>
> 3) That's it!
>
> IV The HRegionServer
>
> Maintaining the data for a single HRegion is slightly complicated.   
> It's
> especially weird given the write-once semantics of DFS.  There are
> three important moving parts:
>
> 1) HBackedStore is a file-backed store for rows and their values.
> It is never edited in place.  It has B-Tree-like lookups for finding
> a row quickly.  HBackedStore is actually a series of on-disk stores,
> each store being tuned for a certain object size.  Thus, all the  
> "small"
> (in bytes) values for a row live within the same file, all the medium
> ones live in a separate file, etc.  There is only one HBackedStore
> for any single HRegion.
>
> 2) HUpdateLog is a log of updates to the HBackedStore.  It is backed
> by an on-disk file.  When making reads from the HBackedStore, it may
> be necessary to consult the HUpdateLog to see if any more-recent
> updates have been made.  There may be a series of HUpdateLogs
> for a single HRegion.
>
> 3) HUpdateBuf is an in-memory version of HUpdateLog.  It, too, needs
> to be consulted whenever performing a read.  There is only one
> HUpdateBuf for a single HRegion.
>
> Any incoming edit is first made directly to the HUpdateBuf.  Changes
> made to the HUpdateBuf are volatile until flushed to an HUpdateLog.
> The rate of flushes is an admin-configurable parameter.
>
> Periodically, the HBackedStore and the series of current HUpdateLogs
> are merged to form a new HBackedStore.  At that point, the old  
> HUpdateLog
> objects can be destroyed.  During this compaction process, edits are
> made to the HUpdateBuf.

Re: HBase Design Ideas, Part I

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi,
sorry for the delay in responding...

>> I already posted a mail about this issue.
>> What we may be need is a Writer that can seek first for row key and
>> than for column keys.
>> In general I agree with  sparse structure.
>
>
> What about this: don't store explicit "column" fields anywhere.   
> Rather,
> each row is stored as a series of key-value pairs, where the key is  
> the
> column name.
I didn't got this.
How you want to associate than one key - value pair ( let's name it  
cell) to a row key?
As mentioned I see a object "rowKey - columnName - value" or one  
rowKey - columnKey-Value[]
>
> True, if there are a huge number of columns and you are interested in
> just one, there will be unnecessary processing.  This is especially  
> bad
> if one column is a 2-char string and another column is a video file.
>
> So we should actually keep a family of files, segmented by object  
> size.
> But in the general case, it shouldn't be possible to "seek to a  
> column".
> Instead, you seek to a row and unpack all its key/val (col/cell)  
> pairs.
Hmm, I'm not sure if I like the idea of having size based separated  
files of columns.
I don't think there are many use cases where people will store lets  
say locales and video files associated to the same url row key.
In such a case it makes more sense to have separated tables.
 From my point of view the best way would be to have a kind of column  
seek mechanism, what will require a other kind of sequence writer and  
reader.
As far I remember the google system has all columns of a row in one  
tablet.
What you think about to beeing able have one row in different tablets  
but each tablet has different rows?
So not just distribute the rows but also columns.

>
>
>> My idea was to have the lock on the HRegionServer level, my ideas was
>> that the client itself take care about replication,
>> means write the value to n servers that have the same replicatins of
>> HRegions.
>
>
> Do you mean that a lock applies to an entire server at once?  Or
> that an HRegionServer is responsible for all locks?  (I'd like to do
> the latter, at least in the short-term.)
Yes, the later is better from my point of view.
>
> I'd like to avoid having an HRegion that's hosted by multiple servers,
> because then it's unclear which HRegionServer should own the lock.
> I suppose the HRegionServers for a given HRegion could hold an
> election, but this seems like a lot of work.
>
> If there's a row that's really "hot" and wanted by a lot of  
> clients, I could
> imagine starting a series of "read-only" HRegionServers that field  
> read
> requests.  That way you avoid having an election for the lock but can
> still scale capacity if necessary.
That is a good idea.
>>
>> > The HBase system can repartition an HTable at any time.  For
>> > example, many
>> > repeated inserts at a single location may cause a single HRegion to
>> > grow
>> > very large.  The HBase would then try to split that into multiple
>> > HRegions.
>> > Those HRegions may be served by the same HRegionServer as the
>> > original or may be served by a different one.
>> Would the node send out a message to request a split or does the
>> master decide based on heart beat messages?
>
>
> There are two ways that an HRegionServer might offer brand-new
> service for an HRegion:
> 1) The HRegion's old HRegionServer died.  A new HRegionServer
> offers the exact same HRegion, loaded from a DFS file.  This will
> have to be initiated by the HBaseMaster, because it is the only  
> node that
> knows about heartbeats.
Make sense.
>
> 2) An HRegion is getting too big, and must be split into two.  I
> imagine that this can be initiated by the local HRegionServer,
> which then asks the master for various hints (like where there
> is another lightly-loaded HRegionServer that could take a new
> Region).
May be the local Region Server just request to be spitted and the  
master handle the split itself.
My concern is that just using heart beats to announce regions to the  
master is not fast enough.
Means when region is splitted all rows need to be read only during  
the process. The master need to know the two new regions before we  
remove the write lock.

>
> My idea was to simply download the data to the node and read any time
>> locally, but write into the dfs, since in my case write access can be
>> slower but I needer very fast read access.
>
>
> You mean just keep a local cache of the DFS file?  That might be
> a good idea for a feature we add into DFS as a performance  
> enhancement.

Yes, reading files from DFS is too slow,
we ran into the same performance problem to often in the several  
projects.

For example reading a lucene index file - as nutch does - from dfs is  
just useless. But loading a copy to the local hdd is fast enough  
during startup.
In general I don't think disk space is an issue these days, so I have  
no problem to have data replicated in the dfs and on a local hdd.

Re: HBase Design Ideas, Part I

Posted by Michael Cafarella <mi...@gmail.com>.

Hi Stefan,

Thanks for your mail.  Comments below.

On 5/15/06, Stefan Groschupf <sg...@media-style.com> wrote:
>
> I was playing around with row and run in to several problems using
> the hadoop io package. (SequenceReader writer)
> Optimal would be if a cell is a writable but having rowkey and cell
> key and value for each sell blows up disk usage.
> Alternative we can have a row writable so we only one rowkey , n
> column key and n values.
> In case a row has many column this scales very bad. For example my
> row key is a url and my column keys are user ids and the value are
> number of clicks.
> if I want to get the number of clicks for a given url and user, I
> need to load the values for all other user as well. :(
>
> I already posted a mail about this issue.
> What we may be need is a Writer that can seek first for row key and
> than for column keys.
> In general I agree with  sparse structure.

What about this: don't store explicit "column" fields anywhere.  Rather,
each row is stored as a series of key-value pairs, where the key is the
column name.

True, if there are a huge number of columns and you are interested in
just one, there will be unnecessary processing.  This is especially bad
if one column is a 2-char string and another column is a video file.

So we should actually keep a family of files, segmented by object size.
But in the general case, it shouldn't be possible to "seek to a column".
Instead, you seek to a row and unpack all its key/val (col/cell) pairs.

> My idea was to have the lock on the HRegionServer level, my ideas was
> that the client itself take care about replication,
> means write the value to n servers that have the same replicatins of
> HRegions.

Do you mean that a lock applies to an entire server at once?  Or
that an HRegionServer is responsible for all locks?  (I'd like to do
the latter, at least in the short-term.)

I'd like to avoid having an HRegion that's hosted by multiple servers,
because then it's unclear which HRegionServer should own the lock.
I suppose the HRegionServers for a given HRegion could hold an
election, but this seems like a lot of work.

If there's a row that's really "hot" and wanted by a lot of clients, I could
imagine starting a series of "read-only" HRegionServers that field read
requests.  That way you avoid having an election for the lock but can
still scale capacity if necessary.

(I don't think we'll ever have a situation where a flood of writes come in
the door.  If so, the whole design is a bad idea!)

>
> > The HBase system can repartition an HTable at any time.  For
> > example, many
> > repeated inserts at a single location may cause a single HRegion to
> > grow
> > very large.  The HBase would then try to split that into multiple
> > HRegions.
> > Those HRegions may be served by the same HRegionServer as the
> > original or may be served by a different one.
> Would the node send out a message to request a split or does the
> master decide based on heart beat messages?

There are two ways that an HRegionServer might offer brand-new
service for an HRegion:
1) The HRegion's old HRegionServer died.  A new HRegionServer
offers the exact same HRegion, loaded from a DFS file.  This will
have to be initiated by the HBaseMaster, because it is the only node that
knows about heartbeats.

2) An HRegion is getting too big, and must be split into two.  I
imagine that this can be initiated by the local HRegionServer,
which then asks the master for various hints (like where there
is another lightly-loaded HRegionServer that could take a new
Region).

My idea was to simply download the data to the node and read any time
> locally, but write into the dfs, since in my case write access can be
> slower but I needer very fast read access.

You mean just keep a local cache of the DFS file?  That might be
a good idea for a feature we add into DFS as a performance enhancement.

My idea was in such a case the HRegionServer may be know the new
> location at least until the master is informed.
> So getting a forward message could be faster than get an error and
> try ask for the target again.

The old HRegionServer may not know the new location, depending
on how the new one was created.  (The old one might be dead, too!)
But if we can speed things up substantially by forwarding the location,
I think that's OK.

Re: HBase Design Ideas, Part I

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi,
sounds pretty much similar to what I was thinking about,
just that I had used different terms and you description is much more  
elegant than my hand written notes.
Some comments below.

> ---------------------------------------------------------------------- 
> ----------
> I.  Table semantics
>
> An HBase consists of one or more HTables.  An HTable is a list of  
> rows,
> sorted alphabetically by "row name".  An HTable also has a series of
> "columns."  A row may or may not contain a value for a column.  The
> HTable representation is sparse, so if a row does not contain a value
> for a given column, there is no storage overhead.
>
> (Thus, there's not really a "schema" to an HTable.  Every  
> operation, even
> adding a column, is considered a row-centric operation.)
>
> The "current version" of a row is always available, timestamped  
> with its
> last modification date.  The system may also store previous  
> versions of a row,
> according to how the HTable is configured.
I was playing around with row and run in to several problems using  
the hadoop io package. (SequenceReader writer)
Optimal would be if a cell is a writable but having rowkey and cell  
key and value for each sell blows up disk usage.
Alternative we can have a row writable so we only one rowkey , n  
column key and n values.
In case a row has many column this scales very bad. For example my  
row key is a url and my column keys are user ids and the value are  
number of clicks.
if I want to get the number of clicks for a given url and user, I  
need to load the values for all other user as well. :(

I already posted a mail about this issue.
What we may be need is a Writer that can seek first for row key and  
than for column keys.
In general I agree with  sparse structure.


>
> Updates to a single row are always atomic and can affect one or  
> more columns.
>
> II.  System layout
>
> HTables are partitionable into contiguous row regions called HRegions.
> All machines in a pool run an HRegionServer.  A given HRegion is  
> served
> to clients by a single HRegionServer.  A single HRegionServer may be
> responsible for many HRegions.  The HRegions for a single HTable will
> be scattered across arbitrary HRegionServers.
>
> When a client wants to add/delete/update a row value, it must  
> locate the
> relevant HRegionServer.  It then contacts the HRegionServer and  
> communicates
> the updates.  There may be other steps, mainly lock-oriented ones.   
> But locating
> the relevant HRegionServers is a bare minimum.

My idea was to have the lock on the HRegionServer level, my ideas was  
that the client itself take care about replication,
means write the value to n servers that have the same replicatins of  
HRegions.

>
> The HBase system can repartition an HTable at any time.  For  
> example, many
> repeated inserts at a single location may cause a single HRegion to  
> grow
> very large.  The HBase would then try to split that into multiple  
> HRegions.
> Those HRegions may be served by the same HRegionServer as the
> original or may be served by a different one.
Would the node send out a message to request a split or does the  
master decide based on heart beat messages?
>
> Each HRegionServer sends a regular heartbeat to an HBaseMaster  
> machine.
> If the heartbeat for an HRegionServer fails, then the HBaseMaster  
> is responsible
> for reassigning its HRegions to other available HRegionServers.
>
> All HRegions are stored within DFS, so the HRegion is always  
> available, even
> in the face of machine failures.  The HRegionServers and DFS  
> DataNodes run
> on the same set of machines.  We would like for an HRegionServer to  
> always
> serve data stored locally, but that is not guaranteed when using  
> DFS.  We can
> encourage it by:
> 1) In the event of an insert-motivated HRegion move, the new  
> HRegionServer
> should always create a new DFS file for the new HRegion.  The DFS  
> rules of
> thumb will allocate the chunks locally for the HRegionServer.
> 2) In the even of a machine failure, we cannot do anything similar  
> to above.
> Instead, the HBaseMaster can ask DFS for hints as to where the  
> relevant
> file blocks are stored.  If possible, it will allocate the new
> HRegions to servers
> that physically contain the HRegion.
> 3) If necessary, we could add an API to DFS that demands block  
> replication
> to a given node.  I'd like to avoid this if possible.
My idea was to simply download the data to the node and read any time  
locally, but write into the dfs, since in my case write access can be  
slower but I needer very fast read access.

>
> The mapping from row to HRegion (and hence, to HRegionServer) is  
> itself
> stored in a special HTable.  The HBaseMaster is the only client  
> allowed to
> write to this HTable.  This special HTable may itself be split into  
> several
> HRegions.  However, we only allow a hard-coded number of split-levels.
> The top level of this hierarchy must be easily-stored on a single  
> machine.
> That top-level table is always served by the HBaseMaster itself.
>
> III.  Client behavior
>
> Let's think about what happens when a client wants to add a row.
> 1) The client must compute what HRegion is responsible for the key
> it wants to insert into the HTable.  It must navigate the row->HRegion
> mapping, which is stored in an HTable.
>
> So the client first contacts the HBaseMaster for the top-level  
> table contents.
> It then steps downward through the table set, until it finds the  
> mapping for
> the target row.
>
> 2) The client contacts the HRegionServer responsible for the target  
> row,
> and asks to insert.  If the HRegionServer is no longer responsible  
> for the
> relevant HRegion, it returns a failure message and tells the client  
> to go
> back to step 1 to find the new correct HRegionServer.
My idea was in such a case the HRegionServer may be know the new  
location at least until the master is informed.
So getting a forward message could be faster than get an error and  
try ask for the target again.
>
> If the HRegionServer is the right place to go, it accepts the new  
> row from
> the client.  The HRegionServer guarantees that the insert is  
> atomic; it
> will not intermingle the insert with a competing insert for the  
> same row key.
> When the row is stored, the HRegionServer includes version and  
> timestamp
> information.
>
> 3) That's it!
>
> IV The HRegionServer
>
> Maintaining the data for a single HRegion is slightly complicated.   
> It's
> especially weird given the write-once semantics of DFS.  There are
> three important moving parts:
>
> 1) HBackedStore is a file-backed store for rows and their values.
> It is never edited in place.  It has B-Tree-like lookups for finding
> a row quickly.  HBackedStore is actually a series of on-disk stores,
> each store being tuned for a certain object size.  Thus, all the  
> "small"
> (in bytes) values for a row live within the same file, all the medium
> ones live in a separate file, etc.  There is only one HBackedStore
> for any single HRegion.
>
> 2) HUpdateLog is a log of updates to the HBackedStore.  It is backed
> by an on-disk file.  When making reads from the HBackedStore, it may
> be necessary to consult the HUpdateLog to see if any more-recent
> updates have been made.  There may be a series of HUpdateLogs
> for a single HRegion.
>
> 3) HUpdateBuf is an in-memory version of HUpdateLog.  It, too, needs
> to be consulted whenever performing a read.  There is only one
> HUpdateBuf for a single HRegion.
>
> Any incoming edit is first made directly to the HUpdateBuf.  Changes
> made to the HUpdateBuf are volatile until flushed to an HUpdateLog.
> The rate of flushes is an admin-configurable parameter.
>
> Periodically, the HBackedStore and the series of current HUpdateLogs
> are merged to form a new HBackedStore.  At that point, the old  
> HUpdateLog
> objects can be destroyed.  During this compaction process, edits are
> made to the HUpdateBuf.

Sounds great!
Looking forward to see that working.
Stefan