You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Praveen Sripati <pr...@gmail.com> on 2012/01/21 08:08:01 UTC

Disk Seeks and Column families

Hi,

1) According to the this url (1), HBase performs well for two or three
column families. Why is it so?

2) Dump of a HFile, looks like below. The contents of a row stay together
like a regular row-oriented database. If the column family has 100 column
family qualifiers and is dense then the data for a particular column family
qualifier is spread wide. If I want to do an aggregation on a particular
column identifier, the disk seeks doesn't seems to be much better than a
regular row-oriented database.

Please correct me if I am wrong.

K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52

(1) - http://hbase.apache.org/book/number.of.cfs.html

Thanks,
Praveen

Re: Disk Seeks and Column families

Posted by Andrey Stepachev <oc...@gmail.com>.

2012/1/24 Andrey Stepachev <oc...@gmail.com>:
> 2012/1/24 Praveen Sripati <pr...@gmail.com>:
>
> a) As in 1), add something to key. For example each 5 minutes. Later your
> can issue 16 queries and merge them (for realtime)

eah... 3 minutes :)

-- 
Andrey.

Re: Disk Seeks and Column families

Posted by Andrey Stepachev <oc...@gmail.com>.

2012/1/24 Praveen Sripati <pr...@gmail.com>:
> Thanks for the response. I am just getting started with HBase. And before
> getting into the code/api level details, I am trying to understand the
> problem area HBase is trying to address through it's architecture/design.
>
> 1) So, what are the recommendations for having many columns and with dense
> data? Is HBase not the right tool?

Split them by prefixing keys. (i.e. key->a,b,c => a_key, b_key, c_key).
and aggregate as independent values. (if possible)

>
> 2) Also, if the data for a column is spread wide across blocks and maybe
> even across nodes how will HBase help in aggregation?

Think and optimize your data layout for your tasks. HBase is not an rdbs.
You should plan you schema in a way, that suites best for your queries.

>
> 3) Also, about storing data using an incremental row key, initially there
> will be a hot stop with the data getting to a single region. Even after a
> split of the region into two, the first one won't be getting any data (in
> incremental row key) and the second one will be hammered.

a) As in 1), add something to key. For example each 5 minutes. Later your
can issue 16 queries and merge them (for realtime)
b) If this data for mapreduce, you can do key day + (md5(time)) later MR
task collect all data in right place for aggregation.
(as usual you must tradeoff write speed and query speed).
c) split your incoming data by other field, for example host or metric.
You can look at data model of the http://opentsdb.net/

>
> One of the approach to alleviate this is not to insert incremental row keys
> from the client and have the row keys scattered for better load balancing.
> But, this approach is not efficient if I want to get events in a time
> sequence, in which case I have to use some filters to scan the entire data.
>
> 4) Still not clear why I can't have 10 column families in HBase and why
> only 2 or 3 of them according to this link (1)?

You can.
But
a) you should tune a bunch of parameters
hbase.hregion.memstore.block.multiplier,
hbase.hstore.blockingStoreFiles and others
to get it works at high write load. But according to architecture
of memstore and splits less families performs better.
b) you can write small benchmark and see, that 2 family is significally faster
then 10.


>
> (1) - http://hbase.apache.org/book/number.of.cfs.html
>
> Praveen
>
> On Sun, Jan 22, 2012 at 12:02 PM, M. C. Srivas <mc...@gmail.com> wrote:
>
>> Praveen,
>>
>>  basically you are correct on all counts. If there are too many columns,
>>  HBase will have to issue more disk-seeks  to extract only the particular
>> columns you need ... and since the data is laid out horizontally there are
>> fewer common substrings in a single HBase-block and compression quality
>> starts to degrade due to reduced redundancy.
>>
>>
>> On Sat, Jan 21, 2012 at 9:49 AM, Praveen Sripati
>> <pr...@gmail.com>wrote:
>>
>> > Thanks for the response.
>> >
>> > > The contents of a row stay together like a regular row-oriented
>> database.
>> >
>> > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
>> > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
>> > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
>> > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
>> > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
>> >
>> > Is the above statement true for a HFile?
>> >
>> > Also from the above example, the data for the column family qualifier are
>> > not adjacent to take advantage of compression (
>> > http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is this
>> a
>> > proper statement?
>> >
>> > Regards,
>> > Praveen
>> >
>> > On Sat, Jan 21, 2012 at 9:03 PM, <yu...@gmail.com> wrote:
>> >
>> > > Have you considered using AggregationProtocol to perform aggregation ?
>> > >
>> > > Thanks
>> > >
>> > >
>> > >
>> > > On Jan 20, 2012, at 11:08 PM, Praveen Sripati <
>> praveensripati@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > 1) According to the this url (1), HBase performs well for two or
>> three
>> > > > column families. Why is it so?
>> > > >
>> > > > 2) Dump of a HFile, looks like below. The contents of a row stay
>> > together
>> > > > like a regular row-oriented database. If the column family has 100
>> > column
>> > > > family qualifiers and is dense then the data for a particular column
>> > > family
>> > > > qualifier is spread wide. If I want to do an aggregation on a
>> > particular
>> > > > column identifier, the disk seeks doesn't seems to be much better
>> than
>> > a
>> > > > regular row-oriented database.
>> > > >
>> > > > Please correct me if I am wrong.
>> > > >
>> > > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
>> > > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
>> > > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
>> > > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
>> > > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
>> > > >
>> > > > (1) - http://hbase.apache.org/book/number.of.cfs.html
>> > > >
>> > > > Thanks,
>> > > > Praveen
>> > >
>> >
>>



-- 
Andrey.

Re: Disk Seeks and Column families

Posted by Jason Frantz <jf...@maprtech.com>.

On Tue, Jan 24, 2012 at 11:45 AM, Praveen Sripati
<pr...@gmail.com>wrote:

> Thanks for the response. I am just getting started with HBase. And before
> getting into the code/api level details, I am trying to understand the
> problem area HBase is trying to address through it's architecture/design.
>
> 1) So, what are the recommendations for having many columns and with dense
> data? Is HBase not the right tool?
>

HBase's data model works great if your set of columns can be split into
separate column families that are only accessed together. If you often
randomly access individual columns, then it might make sense to put your
column qualifiers inside your key.

2) Also, if the data for a column is spread wide across blocks and maybe
> even across nodes how will HBase help in aggregation?
>

If a column family doesn't contain the columns your aggregation wants, then
HBase doesn't need to look at files for those column families. If you want
to run the aggregation on a subset of your key's range, then HBase doesn't
need to look at nodes that only have data outside that range.

In addition, aggregation can often be done locally at each node using
endpoint coprocessors. For example, if I want to count all the rows in my
table, a coprocessor can count all the rows on each node in parallel, and
then those counts are the only thing sent back to node running the query.
To get the total count, I just need to sum the per-node counts.

http://ofps.oreilly.com/titles/9781449396107/clientapisadv.html


> 3) Also, about storing data using an incremental row key, initially there
> will be a hot stop with the data getting to a single region. Even after a
> split of the region into two, the first one won't be getting any data (in
> incremental row key) and the second one will be hammered.
>

Can you split your incremental row key into a hash component and a range
component? Here's a DynamoDB post explaining a use case:

http://aws.typepad.com/aws/2012/01/amazon-dynamodb-internet-scale-data-storage-the-nosql-way.html

This does mean that range scan is only efficient when it stays within a
hash prefix, though.

4) Still not clear why I can't have 10 column families in HBase and why
> only 2 or 3 of them according to this link (1)?
>
> (1) - http://hbase.apache.org/book/number.of.cfs.html
>

See HBASE-3149, for starters. There are probably other JIRAs out there.

-Jason


> Praveen
>
> On Sun, Jan 22, 2012 at 12:02 PM, M. C. Srivas <mc...@gmail.com> wrote:
>
> > Praveen,
> >
> >  basically you are correct on all counts. If there are too many columns,
> >  HBase will have to issue more disk-seeks  to extract only the particular
> > columns you need ... and since the data is laid out horizontally there
> are
> > fewer common substrings in a single HBase-block and compression quality
> > starts to degrade due to reduced redundancy.
> >
> >
> > On Sat, Jan 21, 2012 at 9:49 AM, Praveen Sripati
> > <pr...@gmail.com>wrote:
> >
> > > Thanks for the response.
> > >
> > > > The contents of a row stay together like a regular row-oriented
> > database.
> > >
> > > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
> > > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
> > > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
> > > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
> > > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
> > >
> > > Is the above statement true for a HFile?
> > >
> > > Also from the above example, the data for the column family qualifier
> are
> > > not adjacent to take advantage of compression (
> > > http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is
> this
> > a
> > > proper statement?across all of the data.
>
> > >
> > > Regards,
> > > Praveen
> > >
> > > On Sat, Jan 21, 2012 at 9:03 PM, <yu...@gmail.com> wrote:
> > >
> > > > Have you considered using AggregationProtocol to perform aggregation
> ?
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > > On Jan 20, 2012, at 11:08 PM, Praveen Sripati <
> > praveensripati@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > 1) According to the this url (1), HBase performs well for two or
> > three
> > > > > column families. Why is it so?
> > > > >
> > > > > 2) Dump of a HFile, looks like below. The contents of a row stay
> > > together
> > > > > like a regular row-oriented database. If the column family has 100
> > > column
> > > > > family qualifiers and is dense then the data for a particular
> column
> > > > family
> > > > > qualifier is spread wide. If I want to do an aggregation on a
> > > particular
> > > > > column identifier, the disk seeks doesn't seems to be much better
> > than
> > > a
> > > > > regular row-oriented database.
> > > > >
> > > > > Please correct me if I am wrong.
> > > > >
> > > > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
> > > > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
> > > > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
> > > > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
> > > > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
> > > > >
> > > > > (1) - http://hbase.apache.org/book/number.of.cfs.html
> > > > >
> > > > > Thanks,
> > > > > Praveen
> > > >
> > >
> >
>

Re: Disk Seeks and Column families

Posted by Praveen Sripati <pr...@gmail.com>.

Thanks for the response. I am just getting started with HBase. And before
getting into the code/api level details, I am trying to understand the
problem area HBase is trying to address through it's architecture/design.

1) So, what are the recommendations for having many columns and with dense
data? Is HBase not the right tool?

2) Also, if the data for a column is spread wide across blocks and maybe
even across nodes how will HBase help in aggregation?

3) Also, about storing data using an incremental row key, initially there
will be a hot stop with the data getting to a single region. Even after a
split of the region into two, the first one won't be getting any data (in
incremental row key) and the second one will be hammered.

One of the approach to alleviate this is not to insert incremental row keys
from the client and have the row keys scattered for better load balancing.
But, this approach is not efficient if I want to get events in a time
sequence, in which case I have to use some filters to scan the entire data.

4) Still not clear why I can't have 10 column families in HBase and why
only 2 or 3 of them according to this link (1)?

(1) - http://hbase.apache.org/book/number.of.cfs.html

Praveen

On Sun, Jan 22, 2012 at 12:02 PM, M. C. Srivas <mc...@gmail.com> wrote:

> Praveen,
>
>  basically you are correct on all counts. If there are too many columns,
>  HBase will have to issue more disk-seeks  to extract only the particular
> columns you need ... and since the data is laid out horizontally there are
> fewer common substrings in a single HBase-block and compression quality
> starts to degrade due to reduced redundancy.
>
>
> On Sat, Jan 21, 2012 at 9:49 AM, Praveen Sripati
> <pr...@gmail.com>wrote:
>
> > Thanks for the response.
> >
> > > The contents of a row stay together like a regular row-oriented
> database.
> >
> > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
> > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
> > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
> > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
> > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
> >
> > Is the above statement true for a HFile?
> >
> > Also from the above example, the data for the column family qualifier are
> > not adjacent to take advantage of compression (
> > http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is this
> a
> > proper statement?
> >
> > Regards,
> > Praveen
> >
> > On Sat, Jan 21, 2012 at 9:03 PM, <yu...@gmail.com> wrote:
> >
> > > Have you considered using AggregationProtocol to perform aggregation ?
> > >
> > > Thanks
> > >
> > >
> > >
> > > On Jan 20, 2012, at 11:08 PM, Praveen Sripati <
> praveensripati@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > 1) According to the this url (1), HBase performs well for two or
> three
> > > > column families. Why is it so?
> > > >
> > > > 2) Dump of a HFile, looks like below. The contents of a row stay
> > together
> > > > like a regular row-oriented database. If the column family has 100
> > column
> > > > family qualifiers and is dense then the data for a particular column
> > > family
> > > > qualifier is spread wide. If I want to do an aggregation on a
> > particular
> > > > column identifier, the disk seeks doesn't seems to be much better
> than
> > a
> > > > regular row-oriented database.
> > > >
> > > > Please correct me if I am wrong.
> > > >
> > > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
> > > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
> > > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
> > > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
> > > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
> > > >
> > > > (1) - http://hbase.apache.org/book/number.of.cfs.html
> > > >
> > > > Thanks,
> > > > Praveen
> > >
> >
>

Re: Disk Seeks and Column families

Posted by "M. C. Srivas" <mc...@gmail.com>.

Praveen,

 basically you are correct on all counts. If there are too many columns,
 HBase will have to issue more disk-seeks  to extract only the particular
columns you need ... and since the data is laid out horizontally there are
fewer common substrings in a single HBase-block and compression quality
starts to degrade due to reduced redundancy.


On Sat, Jan 21, 2012 at 9:49 AM, Praveen Sripati
<pr...@gmail.com>wrote:

> Thanks for the response.
>
> > The contents of a row stay together like a regular row-oriented database.
>
> > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
> > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
> > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
> > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
> > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
>
> Is the above statement true for a HFile?
>
> Also from the above example, the data for the column family qualifier are
> not adjacent to take advantage of compression (
> http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is this a
> proper statement?
>
> Regards,
> Praveen
>
> On Sat, Jan 21, 2012 at 9:03 PM, <yu...@gmail.com> wrote:
>
> > Have you considered using AggregationProtocol to perform aggregation ?
> >
> > Thanks
> >
> >
> >
> > On Jan 20, 2012, at 11:08 PM, Praveen Sripati <pr...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > 1) According to the this url (1), HBase performs well for two or three
> > > column families. Why is it so?
> > >
> > > 2) Dump of a HFile, looks like below. The contents of a row stay
> together
> > > like a regular row-oriented database. If the column family has 100
> column
> > > family qualifiers and is dense then the data for a particular column
> > family
> > > qualifier is spread wide. If I want to do an aggregation on a
> particular
> > > column identifier, the disk seeks doesn't seems to be much better than
> a
> > > regular row-oriented database.
> > >
> > > Please correct me if I am wrong.
> > >
> > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
> > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
> > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
> > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
> > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
> > >
> > > (1) - http://hbase.apache.org/book/number.of.cfs.html
> > >
> > > Thanks,
> > > Praveen
> >
>

Re: Disk Seeks and Column families

Posted by Doug Meil <do...@explorysmedical.com>.

Compression is at the block level within the StoreFile (Hfile), so yes,
they can take advantage of compression.



On 1/21/12 12:49 PM, "Praveen Sripati" <pr...@gmail.com> wrote:

>Thanks for the response.
>
>> The contents of a row stay together like a regular row-oriented
>>database.
>
>> K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
>> K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
>> K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
>> K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
>> K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
>
>Is the above statement true for a HFile?
>
>Also from the above example, the data for the column family qualifier are
>not adjacent to take advantage of compression (
>http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is this a
>proper statement?
>
>Regards,
>Praveen
>
>On Sat, Jan 21, 2012 at 9:03 PM, <yu...@gmail.com> wrote:
>
>> Have you considered using AggregationProtocol to perform aggregation ?
>>
>> Thanks
>>
>>
>>
>> On Jan 20, 2012, at 11:08 PM, Praveen Sripati <pr...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > 1) According to the this url (1), HBase performs well for two or three
>> > column families. Why is it so?
>> >
>> > 2) Dump of a HFile, looks like below. The contents of a row stay
>>together
>> > like a regular row-oriented database. If the column family has 100
>>column
>> > family qualifiers and is dense then the data for a particular column
>> family
>> > qualifier is spread wide. If I want to do an aggregation on a
>>particular
>> > column identifier, the disk seeks doesn't seems to be much better
>>than a
>> > regular row-oriented database.
>> >
>> > Please correct me if I am wrong.
>> >
>> > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
>> > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
>> > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
>> > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
>> > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
>> >
>> > (1) - http://hbase.apache.org/book/number.of.cfs.html
>> >
>> > Thanks,
>> > Praveen
>>

Re: Disk Seeks and Column families

Posted by Praveen Sripati <pr...@gmail.com>.

Thanks for the response.

> The contents of a row stay together like a regular row-oriented database.

> K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
> K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
> K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
> K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
> K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52

Is the above statement true for a HFile?

Also from the above example, the data for the column family qualifier are
not adjacent to take advantage of compression (
http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is this a
proper statement?

Regards,
Praveen

On Sat, Jan 21, 2012 at 9:03 PM, <yu...@gmail.com> wrote:

> Have you considered using AggregationProtocol to perform aggregation ?
>
> Thanks
>
>
>
> On Jan 20, 2012, at 11:08 PM, Praveen Sripati <pr...@gmail.com>
> wrote:
>
> > Hi,
> >
> > 1) According to the this url (1), HBase performs well for two or three
> > column families. Why is it so?
> >
> > 2) Dump of a HFile, looks like below. The contents of a row stay together
> > like a regular row-oriented database. If the column family has 100 column
> > family qualifiers and is dense then the data for a particular column
> family
> > qualifier is spread wide. If I want to do an aggregation on a particular
> > column identifier, the disk seeks doesn't seems to be much better than a
> > regular row-oriented database.
> >
> > Please correct me if I am wrong.
> >
> > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
> > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
> > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
> > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
> > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
> >
> > (1) - http://hbase.apache.org/book/number.of.cfs.html
> >
> > Thanks,
> > Praveen
>

Re: Disk Seeks and Column families

Posted by yu...@gmail.com.

Have you considered using AggregationProtocol to perform aggregation ?

Thanks



On Jan 20, 2012, at 11:08 PM, Praveen Sripati <pr...@gmail.com> wrote:

> Hi,
> 
> 1) According to the this url (1), HBase performs well for two or three
> column families. Why is it so?
> 
> 2) Dump of a HFile, looks like below. The contents of a row stay together
> like a regular row-oriented database. If the column family has 100 column
> family qualifiers and is dense then the data for a particular column family
> qualifier is spread wide. If I want to do an aggregation on a particular
> column identifier, the disk seeks doesn't seems to be much better than a
> regular row-oriented database.
> 
> Please correct me if I am wrong.
> 
> K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
> K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
> K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
> K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
> K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
> 
> (1) - http://hbase.apache.org/book/number.of.cfs.html
> 
> Thanks,
> Praveen

Re: Disk Seeks and Column families

Posted by Andrey Stepachev <oc...@gmail.com>.

21 января 2012 г. 19:16 пользователь Doug Meil
<do...@explorysmedical.com> написал:
>
> One other "big picture" comment:  Hbase scales by having lots of servers,
> and servers with multiple drives. While single-read performance is
> obviously important, there is more to Hbase than a single-server RDBMS
> drag-race comparison.  It's a distributed architecture (as with MapReduce).
>
> re:  "hbase is not so good in case of wide tables, hbase prefers tall
> tables"
>
> Per... http://hbase.apache.org/book.html#schema.smackdown  this is
> absolutely true in the extreme cases as described in the book, but I
> wouldn't consider hundreds or thousands of attributes to be in that
> category as the definition of "wide" tends to be subjective.

This statement mostly related to schemas, where column name is
a subkey. For example: timeseries for some attribute. Such situation
not scales well, and not handled well by hbase.
(due of splits, which are performed on rows boundary).

>
>
>
>
> On 1/21/12 8:52 AM, "Doug Meil" <do...@explorysmedical.com> wrote:
>
>>
>>Also, for #2 Hbase supports large-scale aggregation through MapReduce.
>>
>>
>>
>>
>>On 1/21/12 7:47 AM, "Andrey Stepachev" <oc...@gmail.com> wrote:
>>
>>>2012/1/21 Praveen Sripati <pr...@gmail.com>:
>>>> Hi,
>>>>
>>>> 1) According to the this url (1), HBase performs well for two or three
>>>> column families. Why is it so?
>>>
>>>Frist, each column family stored in separate location, so, as stated in
>>>'6.2.1. Cardinality of ColumnFamilies', such schema design can lead
>>>to many small pieces for small column family and aggregate should
>>>perform slowly.
>>>Second, if region split, all column families will split too,
>>>in case of large  number of them whis can be inefficient.
>>>Third, related to number of memstores. Each column family
>>>has it's own memstore, so it is more likely to hit forced flush
>>>and bloсked writes.
>>>
>>>>
>>>> 2) Dump of a HFile, looks like below. The contents of a row stay
>>>>together
>>>> like a regular row-oriented database. If the column family has 100
>>>>column
>>>> family qualifiers and is dense then the data for a particular column
>>>>family
>>>> qualifier is spread wide. If I want to do an aggregation on a
>>>>particular
>>>> column identifier, the disk seeks doesn't seems to be much better than
>>>>a
>>>> regular row-oriented database.
>>>
>>>You don't need seeks for each column, hbase reads blocks and filter
>>>out uneeded data.
>>>And most pefromance gained from collocated keys and compression.
>>>BTW, hbase is not so good in case of wide tables, hbase prefers tall
>>>tables.
>>>
>>>>
>>>> Please correct me if I am wrong.
>>>>
>>>> K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
>>>> K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
>>>> K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
>>>> K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
>>>> K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
>>>>
>>>> (1) - http://hbase.apache.org/book/number.of.cfs.html
>>>>
>>>> Thanks,
>>>> Praveen
>>>
>>>
>>>
>>>--
>>>Andrey.
>>>
>>
>>
>>
>
>



-- 
Andrey.

Re: Disk Seeks and Column families

Posted by Doug Meil <do...@explorysmedical.com>.

One other "big picture" comment:  Hbase scales by having lots of servers,
and servers with multiple drives. While single-read performance is
obviously important, there is more to Hbase than a single-server RDBMS
drag-race comparison.  It's a distributed architecture (as with MapReduce).

re:  "hbase is not so good in case of wide tables, hbase prefers tall
tables"  

Per... http://hbase.apache.org/book.html#schema.smackdown  this is
absolutely true in the extreme cases as described in the book, but I
wouldn't consider hundreds or thousands of attributes to be in that
category as the definition of "wide" tends to be subjective.




On 1/21/12 8:52 AM, "Doug Meil" <do...@explorysmedical.com> wrote:

>
>Also, for #2 Hbase supports large-scale aggregation through MapReduce.
>
>
>
>
>On 1/21/12 7:47 AM, "Andrey Stepachev" <oc...@gmail.com> wrote:
>
>>2012/1/21 Praveen Sripati <pr...@gmail.com>:
>>> Hi,
>>>
>>> 1) According to the this url (1), HBase performs well for two or three
>>> column families. Why is it so?
>>
>>Frist, each column family stored in separate location, so, as stated in
>>'6.2.1. Cardinality of ColumnFamilies', such schema design can lead
>>to many small pieces for small column family and aggregate should
>>perform slowly.
>>Second, if region split, all column families will split too,
>>in case of large  number of them whis can be inefficient.
>>Third, related to number of memstores. Each column family
>>has it's own memstore, so it is more likely to hit forced flush
>>and bloсked writes.
>>
>>>
>>> 2) Dump of a HFile, looks like below. The contents of a row stay
>>>together
>>> like a regular row-oriented database. If the column family has 100
>>>column
>>> family qualifiers and is dense then the data for a particular column
>>>family
>>> qualifier is spread wide. If I want to do an aggregation on a
>>>particular
>>> column identifier, the disk seeks doesn't seems to be much better than
>>>a
>>> regular row-oriented database.
>>
>>You don't need seeks for each column, hbase reads blocks and filter
>>out uneeded data.
>>And most pefromance gained from collocated keys and compression.
>>BTW, hbase is not so good in case of wide tables, hbase prefers tall
>>tables.
>>
>>>
>>> Please correct me if I am wrong.
>>>
>>> K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
>>> K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
>>> K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
>>> K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
>>> K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
>>>
>>> (1) - http://hbase.apache.org/book/number.of.cfs.html
>>>
>>> Thanks,
>>> Praveen
>>
>>
>>
>>-- 
>>Andrey.
>>
>
>
>

Re: Disk Seeks and Column families

Posted by Doug Meil <do...@explorysmedical.com>.

Also, for #2 Hbase supports large-scale aggregation through MapReduce.




On 1/21/12 7:47 AM, "Andrey Stepachev" <oc...@gmail.com> wrote:

>2012/1/21 Praveen Sripati <pr...@gmail.com>:
>> Hi,
>>
>> 1) According to the this url (1), HBase performs well for two or three
>> column families. Why is it so?
>
>Frist, each column family stored in separate location, so, as stated in
>'6.2.1. Cardinality of ColumnFamilies', such schema design can lead
>to many small pieces for small column family and aggregate should
>perform slowly.
>Second, if region split, all column families will split too,
>in case of large  number of them whis can be inefficient.
>Third, related to number of memstores. Each column family
>has it's own memstore, so it is more likely to hit forced flush
>and bloсked writes.
>
>>
>> 2) Dump of a HFile, looks like below. The contents of a row stay
>>together
>> like a regular row-oriented database. If the column family has 100
>>column
>> family qualifiers and is dense then the data for a particular column
>>family
>> qualifier is spread wide. If I want to do an aggregation on a particular
>> column identifier, the disk seeks doesn't seems to be much better than a
>> regular row-oriented database.
>
>You don't need seeks for each column, hbase reads blocks and filter
>out uneeded data.
>And most pefromance gained from collocated keys and compression.
>BTW, hbase is not so good in case of wide tables, hbase prefers tall
>tables.
>
>>
>> Please correct me if I am wrong.
>>
>> K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
>> K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
>> K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
>> K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
>> K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
>>
>> (1) - http://hbase.apache.org/book/number.of.cfs.html
>>
>> Thanks,
>> Praveen
>
>
>
>-- 
>Andrey.
>

Re: Disk Seeks and Column families

Posted by Andrey Stepachev <oc...@gmail.com>.

2012/1/21 Praveen Sripati <pr...@gmail.com>:
> Hi,
>
> 1) According to the this url (1), HBase performs well for two or three
> column families. Why is it so?

Frist, each column family stored in separate location, so, as stated in
'6.2.1. Cardinality of ColumnFamilies', such schema design can lead
to many small pieces for small column family and aggregate should
perform slowly.
Second, if region split, all column families will split too,
in case of large  number of them whis can be inefficient.
Third, related to number of memstores. Each column family
has it's own memstore, so it is more likely to hit forced flush
and bloсked writes.

>
> 2) Dump of a HFile, looks like below. The contents of a row stay together
> like a regular row-oriented database. If the column family has 100 column
> family qualifiers and is dense then the data for a particular column family
> qualifier is spread wide. If I want to do an aggregation on a particular
> column identifier, the disk seeks doesn't seems to be much better than a
> regular row-oriented database.

You don't need seeks for each column, hbase reads blocks and filter
out uneeded data.
And most pefromance gained from collocated keys and compression.
BTW, hbase is not so good in case of wide tables, hbase prefers tall tables.

>
> Please correct me if I am wrong.
>
> K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
> K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
> K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
> K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
> K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
>
> (1) - http://hbase.apache.org/book/number.of.cfs.html
>
> Thanks,
> Praveen



-- 
Andrey.