You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Billy Pearson <sa...@pearsonwholesale.com> on 2009/06/10 03:35:51 UTC

Re: for one specific row: are the values of all columns of one family stored in one physical/grid node?

You should read over the
http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture

The data is sorted by row key, then column:label, timestamp
In that order so if you have row key1 all the labels for columnval1 will be 
stored together in the same file
We do flush more the one file to disk as data is added so the values are not 
always stored together until after a major compaction/merge all store files 
together
But what we mean by stored together is all column1 will be stored in one 
file and column2 would be stored in a separate set of files so if you only 
one data from column1 then you only need to read the data from one set of 
files not all the columns for that row key.

also columns for key1 will not be on all the nodes but just one node in the 
cluster. The table is split by the key values so keys1-100 would be one 
region keys101-200 would be another region all in the same table
We split when the size get to large they split and become two regions and so 
on.
So we look up a key we only have to look at one server

Billy



"Ric Wang" <wq...@gmail.com> wrote in 
message news:21224f560906091155i6bb9b6e1xc59095a01bbc2d50@mail.gmail.com...
> Hi,
>
> Very new to Hadoop and HBase. And sorry about the rudimentary question:
>
> I store my artifacts as rows in an HBase table, and the attributes of each
> artifact as labels within one single column family (ex. myFamily). I may
> have tens of thousands of labels, and millions and millions of rows. Now 
> as
> the data size grows, some document says that, the values of one family 
> will
> be "stored together". I wonder what that really means.
>
> For example, for a given row key (my.key.123), will HBase guarantee that 
> ALL
> its attributes (ie. the values of ALL the labels in "myFamily") of that 
> row
> key be stored on one physical/grid node? In other words, if I want to find
> out ONE contain matching row key "my.key.123" based on its attributes
> (column values), at the implementation level, will HBase be
>
> 1. traversing all the distributed nodes and interrogating the column 
> values;
> aggregating the results coming from all the nodes; and finally finding out
> the matching row key
>
> or
>
> 2. doing atomic operations in parallel on each node locally; and finally,
> only one node will return the matching row key (if there is a match).
>
> My guess is the that the answer depends on if all attributes (in myFamily)
> of a given row are stored on one and only one node.
>
> Hope I didn't make my question very confusing. Very new to column based
> database; please help and bare with me.
>
> Thanks!
> Ric
>

RE: for one specific row: are the values of all columns of one family stored in one physical/grid node?

Posted by "Jim Kellerman (POWERSET)" <Ji...@microsoft.com>.

To expand on Erik's explanation:

A table is made up of one or more regions.

Each region contains all the data for all the rows between its start and end keys.

Each region owns multiple stores, one per column family.

Each region is served from one region server (but regions can migrate from one
region server to another due to region server death, load balancing, etc.)

Based on the row key, the client can determine which region server to talk to. The
client can then fetch from any of the column families for that row by talking to
that one region server.

If you haven't read the Bigtable paper (http://labs.google.com/papers/bigtable.html )
it is highly recommended that you do, because the goal from the start of the HBase
project has to produce something that is as close to Bigtable as possible (esp from
the client point of view), in an open source project so that there is no vendor lock-in.

---
Jim Kellerman, Powerset (Live Search, Microsoft Corporation)


> -----Original Message-----
> From: Erik Holstad [mailto:erikholstad@gmail.com]
> Sent: Thursday, June 11, 2009 12:51 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: for one specific row: are the values of all columns of
> one family stored in one physical/grid node?
>
> Hi!
> Just to be clear what is being said here is that every region
> contains a set
> of stores which holds
> one family each, for that specific row range. And one store can hold
> many
> files with data for that
> store, which in the case of a major compaction turns into one single
> file.
>
> Erik

Re: for one specific row: are the values of all columns of one family stored in one physical/grid node?

Posted by Erik Holstad <er...@gmail.com>.

Hi!
Just to be clear what is being said here is that every region contains a set
of stores which holds
one family each, for that specific row range. And one store can hold many
files with data for that
store, which in the case of a major compaction turns into one single file.

Erik

Re: for one specific row: are the values of all columns of one family stored in one physical/grid node?

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

All the columns for any row key will be stored on one server hosted by one 
region
the regions are split by row key not columns

So all the columns for rowx will be only in one region on one server.

A table is made up of regions 1 to start with as more rows are added the 
regions split by row
each region holds a range of the rows and all the columns for its key row 
range.

Billy



"Ric Wang" <wq...@gmail.com> wrote in 
message news:21224f560906100907m71ab7671u3f299ecedc6380bb@mail.gmail.com...
> Billy,
>
> By saying "columns for key1 will not be on all the nodes but just one node
> in the cluster", you really mean "columns of the SAME family for key1...",
> right?
>
> Please correct me if I am wrong, but I think for the row key "key1", the
> data value of "familyA:lableX" and that of "familyB:labelY" can still be
> stored on two different nodes because they are in two different families. 
> Is
> that correct?
>
> Thanks in advance for your clarification.
> -Ric
>
>
>
> On Tue, Jun 9, 2009 at 8:35 PM, Billy Pearson 
> <sa...@pearsonwholesale.com>wrote:
>
>> You should read over the
>> http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
>>
>> The data is sorted by row key, then column:label, timestamp
>> In that order so if you have row key1 all the labels for columnval1 will 
>> be
>> stored together in the same file
>> We do flush more the one file to disk as data is added so the values are
>> not always stored together until after a major compaction/merge all store
>> files together
>> But what we mean by stored together is all column1 will be stored in one
>> file and column2 would be stored in a separate set of files so if you 
>> only
>> one data from column1 then you only need to read the data from one set of
>> files not all the columns for that row key.
>>
>> also columns for key1 will not be on all the nodes but just one node in 
>> the
>> cluster. The table is split by the key values so keys1-100 would be one
>> region keys101-200 would be another region all in the same table
>> We split when the size get to large they split and become two regions and
>> so on.
>> So we look up a key we only have to look at one server
>>
>> Billy
>>
>>
>>
>> "Ric Wang" <wq...@gmail.com> wrote in 
>> message
>> news:21224f560906091155i6bb9b6e1xc59095a01bbc2d50@mail.gmail.com...
>>
>>  Hi,
>>>
>>> Very new to Hadoop and HBase. And sorry about the rudimentary question:
>>>
>>> I store my artifacts as rows in an HBase table, and the attributes of 
>>> each
>>> artifact as labels within one single column family (ex. myFamily). I may
>>> have tens of thousands of labels, and millions and millions of rows. Now
>>> as
>>> the data size grows, some document says that, the values of one family
>>> will
>>> be "stored together". I wonder what that really means.
>>>
>>> For example, for a given row key (my.key.123), will HBase guarantee that
>>> ALL
>>> its attributes (ie. the values of ALL the labels in "myFamily") of that
>>> row
>>> key be stored on one physical/grid node? In other words, if I want to 
>>> find
>>> out ONE contain matching row key "my.key.123" based on its attributes
>>> (column values), at the implementation level, will HBase be
>>>
>>> 1. traversing all the distributed nodes and interrogating the column
>>> values;
>>> aggregating the results coming from all the nodes; and finally finding 
>>> out
>>> the matching row key
>>>
>>> or
>>>
>>> 2. doing atomic operations in parallel on each node locally; and 
>>> finally,
>>> only one node will return the matching row key (if there is a match).
>>>
>>> My guess is the that the answer depends on if all attributes (in 
>>> myFamily)
>>> of a given row are stored on one and only one node.
>>>
>>> Hope I didn't make my question very confusing. Very new to column based
>>> database; please help and bare with me.
>>>
>>> Thanks!
>>> Ric
>>>
>>>
>>
>>
>
>
> -- 
> Ric Wang
> wqt.work@gmail.com
>

Re: for one specific row: are the values of all columns of one family stored in one physical/grid node?

Posted by Ric Wang <wq...@gmail.com>.

Billy,

By saying "columns for key1 will not be on all the nodes but just one node
in the cluster", you really mean "columns of the SAME family for key1...",
right?

Please correct me if I am wrong, but I think for the row key "key1", the
data value of "familyA:lableX" and that of "familyB:labelY" can still be
stored on two different nodes because they are in two different families. Is
that correct?

Thanks in advance for your clarification.
-Ric



On Tue, Jun 9, 2009 at 8:35 PM, Billy Pearson <sa...@pearsonwholesale.com>wrote:

> You should read over the
> http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
>
> The data is sorted by row key, then column:label, timestamp
> In that order so if you have row key1 all the labels for columnval1 will be
> stored together in the same file
> We do flush more the one file to disk as data is added so the values are
> not always stored together until after a major compaction/merge all store
> files together
> But what we mean by stored together is all column1 will be stored in one
> file and column2 would be stored in a separate set of files so if you only
> one data from column1 then you only need to read the data from one set of
> files not all the columns for that row key.
>
> also columns for key1 will not be on all the nodes but just one node in the
> cluster. The table is split by the key values so keys1-100 would be one
> region keys101-200 would be another region all in the same table
> We split when the size get to large they split and become two regions and
> so on.
> So we look up a key we only have to look at one server
>
> Billy
>
>
>
> "Ric Wang" <wq...@gmail.com> wrote in message
> news:21224f560906091155i6bb9b6e1xc59095a01bbc2d50@mail.gmail.com...
>
>  Hi,
>>
>> Very new to Hadoop and HBase. And sorry about the rudimentary question:
>>
>> I store my artifacts as rows in an HBase table, and the attributes of each
>> artifact as labels within one single column family (ex. myFamily). I may
>> have tens of thousands of labels, and millions and millions of rows. Now
>> as
>> the data size grows, some document says that, the values of one family
>> will
>> be "stored together". I wonder what that really means.
>>
>> For example, for a given row key (my.key.123), will HBase guarantee that
>> ALL
>> its attributes (ie. the values of ALL the labels in "myFamily") of that
>> row
>> key be stored on one physical/grid node? In other words, if I want to find
>> out ONE contain matching row key "my.key.123" based on its attributes
>> (column values), at the implementation level, will HBase be
>>
>> 1. traversing all the distributed nodes and interrogating the column
>> values;
>> aggregating the results coming from all the nodes; and finally finding out
>> the matching row key
>>
>> or
>>
>> 2. doing atomic operations in parallel on each node locally; and finally,
>> only one node will return the matching row key (if there is a match).
>>
>> My guess is the that the answer depends on if all attributes (in myFamily)
>> of a given row are stored on one and only one node.
>>
>> Hope I didn't make my question very confusing. Very new to column based
>> database; please help and bare with me.
>>
>> Thanks!
>> Ric
>>
>>
>
>


-- 
Ric Wang
wqt.work@gmail.com