You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by T Vinod Gupta <tv...@readypulse.com> on 2012/01/10 12:17:29 UTC

size and column count recommendations for rows in hbase

i was scanning through different questions that people asked in this
mailing list regarding choosing the right schema so that map reduce jobs
can be run appropriately and hot regions avoided due to sequential
accesses.
somewhere, i got the impression that it is ok for a row to have millions of
columns and/or have large volume of data per region. but then my map reduce
job to copy rows failed due to row size being too large (121MB). so now i
am confused about whats the recommended way. does it mean that default
region size and other configuration parameters need to be tweaked?

in my use case, my system is receiving lots of metrics for different users
and i need to maintain daily counters for each of them. it is at day
granularity and not a typical TSD series. my row key has user id, metric
name as prefix and day timestamp as suffix. and i keep incrementing the
values. the scale issue happens because i store information about the
source of the metric too. e.g. i store the id of the person who mentioned
my user in a tweet.. I am storing all that information in different columns
of the same row. so the pattern here is variable - you can have a million
people tweet about someone and just 2 people tweet about someone else on a
given day. is it a bad idea to use columns here? i did it this way because
it makes it easy for a different process to run later and aggregate
information such as list all people who mentioned my user during a given
date range.

thanks

Re: size and column count recommendations for rows in hbase

Posted by kisalay <ki...@gmail.com>.

Yes, Vinod, you got it right. I was suggesting to have the secondary users
also part of the row key as the suffix.

On Wed, Jan 11, 2012 at 1:02 AM, T Vinod Gupta <tv...@readypulse.com>wrote:

> Thanks St.Ack and Kisalay.
> In my case, I have primary users and people who interact with my primary
> users. Lets call them secondary users.
> Kisalay, you are right and I already have the primary user, metric name and
> timestamp in my row key. did you mean having the secondary user also part
> of the row key as the suffix? if yes, i might consider that.
> St. Ack - yeah i have all secondary users in the same CF. even if i add new
> CFs, most of the data is in the secondary users data. so it will all stack
> up in the new CF.
>
> Thanks
>
>
> On Tue, Jan 10, 2012 at 11:20 AM, kisalay <ki...@gmail.com> wrote:
>
> > would it make sense to convert your fat table into a tall table by
> keeping
> > the source of the metric as part of the row key (may be as the suffix ?
> ).
> > For accessing all the metrics associated with a particular user, metric
> and
> > time, u will be resorting to prefix match on ur key.
> > Also all the keys for a particular user, metric and time will fall in
> > adjacent regions.
> >
> >
> >
> > On Tue, Jan 10, 2012 at 11:41 PM, Stack <st...@duboce.net> wrote:
> >
> > > On Tue, Jan 10, 2012 at 3:17 AM, T Vinod Gupta <tv...@readypulse.com>
> > > wrote:
> > > > i was scanning through different questions that people asked in this
> > > > mailing list regarding choosing the right schema so that map reduce
> > jobs
> > > > can be run appropriately and hot regions avoided due to sequential
> > > > accesses.
> > > > somewhere, i got the impression that it is ok for a row to have
> > millions
> > > of
> > > > columns and/or have large volume of data per region. but then my map
> > > reduce
> > > > job to copy rows failed due to row size being too large (121MB). so
> > now i
> > > > am confused about whats the recommended way. does it mean that
> default
> > > > region size and other configuration parameters need to be tweaked?
> > > >
> > >
> > > Yeah, if you request all of the row, its going to try and give it to
> > > you even if millions of columns.  You can ask the scan to give you
> > > back a bounded number of columns per iteration so you read through the
> > > big row a piece at a time.
> > >
> > > > in my use case, my system is receiving lots of metrics for different
> > > users
> > > > and i need to maintain daily counters for each of them. it is at day
> > > > granularity and not a typical TSD series. my row key has user id,
> > metric
> > > > name as prefix and day timestamp as suffix. and i keep incrementing
> the
> > > > values. the scale issue happens because i store information about the
> > > > source of the metric too. e.g. i store the id of the person who
> > mentioned
> > > > my user in a tweet.. I am storing all that information in different
> > > columns
> > > > of the same row. so the pattern here is variable - you can have a
> > million
> > > > people tweet about someone and just 2 people tweet about someone else
> > on
> > > a
> > > > given day. is it a bad idea to use columns here? i did it this way
> > > because
> > > > it makes it easy for a different process to run later and aggregate
> > > > information such as list all people who mentioned my user during a
> > given
> > > > date range.
> > > >
> > >
> > > All in one column family?  Would it make sense to have more than one
> CF?
> > >
> > > St.Ack
> > >
> >
>

Re: size and column count recommendations for rows in hbase

Posted by T Vinod Gupta <tv...@readypulse.com>.

Thanks St.Ack and Kisalay.
In my case, I have primary users and people who interact with my primary
users. Lets call them secondary users.
Kisalay, you are right and I already have the primary user, metric name and
timestamp in my row key. did you mean having the secondary user also part
of the row key as the suffix? if yes, i might consider that.
St. Ack - yeah i have all secondary users in the same CF. even if i add new
CFs, most of the data is in the secondary users data. so it will all stack
up in the new CF.

Thanks


On Tue, Jan 10, 2012 at 11:20 AM, kisalay <ki...@gmail.com> wrote:

> would it make sense to convert your fat table into a tall table by keeping
> the source of the metric as part of the row key (may be as the suffix ? ).
> For accessing all the metrics associated with a particular user, metric and
> time, u will be resorting to prefix match on ur key.
> Also all the keys for a particular user, metric and time will fall in
> adjacent regions.
>
>
>
> On Tue, Jan 10, 2012 at 11:41 PM, Stack <st...@duboce.net> wrote:
>
> > On Tue, Jan 10, 2012 at 3:17 AM, T Vinod Gupta <tv...@readypulse.com>
> > wrote:
> > > i was scanning through different questions that people asked in this
> > > mailing list regarding choosing the right schema so that map reduce
> jobs
> > > can be run appropriately and hot regions avoided due to sequential
> > > accesses.
> > > somewhere, i got the impression that it is ok for a row to have
> millions
> > of
> > > columns and/or have large volume of data per region. but then my map
> > reduce
> > > job to copy rows failed due to row size being too large (121MB). so
> now i
> > > am confused about whats the recommended way. does it mean that default
> > > region size and other configuration parameters need to be tweaked?
> > >
> >
> > Yeah, if you request all of the row, its going to try and give it to
> > you even if millions of columns.  You can ask the scan to give you
> > back a bounded number of columns per iteration so you read through the
> > big row a piece at a time.
> >
> > > in my use case, my system is receiving lots of metrics for different
> > users
> > > and i need to maintain daily counters for each of them. it is at day
> > > granularity and not a typical TSD series. my row key has user id,
> metric
> > > name as prefix and day timestamp as suffix. and i keep incrementing the
> > > values. the scale issue happens because i store information about the
> > > source of the metric too. e.g. i store the id of the person who
> mentioned
> > > my user in a tweet.. I am storing all that information in different
> > columns
> > > of the same row. so the pattern here is variable - you can have a
> million
> > > people tweet about someone and just 2 people tweet about someone else
> on
> > a
> > > given day. is it a bad idea to use columns here? i did it this way
> > because
> > > it makes it easy for a different process to run later and aggregate
> > > information such as list all people who mentioned my user during a
> given
> > > date range.
> > >
> >
> > All in one column family?  Would it make sense to have more than one CF?
> >
> > St.Ack
> >
>

Re: size and column count recommendations for rows in hbase

Posted by kisalay <ki...@gmail.com>.

would it make sense to convert your fat table into a tall table by keeping
the source of the metric as part of the row key (may be as the suffix ? ).
For accessing all the metrics associated with a particular user, metric and
time, u will be resorting to prefix match on ur key.
Also all the keys for a particular user, metric and time will fall in
adjacent regions.



On Tue, Jan 10, 2012 at 11:41 PM, Stack <st...@duboce.net> wrote:

> On Tue, Jan 10, 2012 at 3:17 AM, T Vinod Gupta <tv...@readypulse.com>
> wrote:
> > i was scanning through different questions that people asked in this
> > mailing list regarding choosing the right schema so that map reduce jobs
> > can be run appropriately and hot regions avoided due to sequential
> > accesses.
> > somewhere, i got the impression that it is ok for a row to have millions
> of
> > columns and/or have large volume of data per region. but then my map
> reduce
> > job to copy rows failed due to row size being too large (121MB). so now i
> > am confused about whats the recommended way. does it mean that default
> > region size and other configuration parameters need to be tweaked?
> >
>
> Yeah, if you request all of the row, its going to try and give it to
> you even if millions of columns.  You can ask the scan to give you
> back a bounded number of columns per iteration so you read through the
> big row a piece at a time.
>
> > in my use case, my system is receiving lots of metrics for different
> users
> > and i need to maintain daily counters for each of them. it is at day
> > granularity and not a typical TSD series. my row key has user id, metric
> > name as prefix and day timestamp as suffix. and i keep incrementing the
> > values. the scale issue happens because i store information about the
> > source of the metric too. e.g. i store the id of the person who mentioned
> > my user in a tweet.. I am storing all that information in different
> columns
> > of the same row. so the pattern here is variable - you can have a million
> > people tweet about someone and just 2 people tweet about someone else on
> a
> > given day. is it a bad idea to use columns here? i did it this way
> because
> > it makes it easy for a different process to run later and aggregate
> > information such as list all people who mentioned my user during a given
> > date range.
> >
>
> All in one column family?  Would it make sense to have more than one CF?
>
> St.Ack
>

Re: size and column count recommendations for rows in hbase

Posted by Stack <st...@duboce.net>.

On Tue, Jan 10, 2012 at 3:17 AM, T Vinod Gupta <tv...@readypulse.com> wrote:
> i was scanning through different questions that people asked in this
> mailing list regarding choosing the right schema so that map reduce jobs
> can be run appropriately and hot regions avoided due to sequential
> accesses.
> somewhere, i got the impression that it is ok for a row to have millions of
> columns and/or have large volume of data per region. but then my map reduce
> job to copy rows failed due to row size being too large (121MB). so now i
> am confused about whats the recommended way. does it mean that default
> region size and other configuration parameters need to be tweaked?
>

Yeah, if you request all of the row, its going to try and give it to
you even if millions of columns.  You can ask the scan to give you
back a bounded number of columns per iteration so you read through the
big row a piece at a time.

> in my use case, my system is receiving lots of metrics for different users
> and i need to maintain daily counters for each of them. it is at day
> granularity and not a typical TSD series. my row key has user id, metric
> name as prefix and day timestamp as suffix. and i keep incrementing the
> values. the scale issue happens because i store information about the
> source of the metric too. e.g. i store the id of the person who mentioned
> my user in a tweet.. I am storing all that information in different columns
> of the same row. so the pattern here is variable - you can have a million
> people tweet about someone and just 2 people tweet about someone else on a
> given day. is it a bad idea to use columns here? i did it this way because
> it makes it easy for a different process to run later and aggregate
> information such as list all people who mentioned my user during a given
> date range.
>

All in one column family?  Would it make sense to have more than one CF?

St.Ack