You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by AnandaVelMurugan Chandra Mohan <an...@gmail.com> on 2012/07/16 07:30:11 UTC

Rowkey hashing to avoid hotspotting

Hi,

I am using Hbase to store data about mechanical components. Each component
has model no. and serial no. and some other attributes.

I would be querying my data mostly by model no. and serial no. So I created
a composite key with these two attributes and added timestamp to make it
unique.

To filter the data, I use rowkey filter with regex string comparator and it
works well with sample seed data. Now I am afraid whether this set up will
lead to region server hotspotting when we load production data in HBase. I
read hashing may solve this problem. Can some one help me in implementing
hashing the row key? Also I would want the row filter to work as I have to
display the number of components in a web page and I use row key filter for
implementing that functionality? Any guidance would be of great help.

-- 
Regards,
Anand

Re: Rowkey hashing to avoid hotspotting

Posted by AnandaVelMurugan Chandra Mohan <an...@gmail.com>.
Thank a lot, Guys!!! I will evaluate and implement a solution based on your
suggestions..

On Thu, Jul 19, 2012 at 10:22 PM, syed kather <in...@gmail.com> wrote:

> Anand ,
>      i had a case which i had combine 4 fields and made one row key .
> serial number can be first part of rowkey and model number can be second
> part . So that B-Search on Row key will be more faster because we can
> reduce lot jump while doing B- Search
> Note : if serial number is changing frequently then use serial number at
> first part
>
>   For solving hot spotting problem i am at present started implementing
>
>
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>
> In my case i had 20 million of rows in my hbase table.  i had the same
> problem while reading in map reduce.
>
>             Thanks and Regards,
>         S SYED ABDUL KATHER
>
>
>
> On Thu, Jul 19, 2012 at 8:52 PM, Alex Baranau <alex.baranov.v@gmail.com
> >wrote:
>
> > > I read somewhere that HBase is not
> > > good at handling more than 100 column families
> >
> > Heh. Usually it is not good to have more than two or three, actually.
> > See [1], and may be also [2].
> >
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> -
> > Solr
> >
> > [1] http://hbase.apache.org/book/number.of.cfs.html
> > [2]
> > http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know
> >
> > On Thu, Jul 19, 2012 at 11:08 AM, AnandaVelMurugan Chandra Mohan <
> > ananthu2050@gmail.com> wrote:
> >
> > > Hi Cristofer,
> > >
> > > No problem... I am happy to share and learn.. :)
> > >
> > > Regarding timestamp based column family, I haven't thought about it.
> But
> > my
> > > only concern is no of column families. I read somewhere that HBase is
> not
> > > good at handling more than 100 column families.
> > >
> > >
> > > On Wed, Jul 18, 2012 at 11:15 PM, Cristofer Weber <
> > > cristofer.weber@neogrid.com> wrote:
> > >
> > > > Hi Anand!
> > > >
> > > > I see... sorry for being so curious, but since I started studying
> > HBase I
> > > > am curious about how people are modeling their tables, and in what
> > kinds
> > > of
> > > > systems HBase is in use.
> > > >
> > > > Have you evaluated recording your reports in a distinct CF using
> > > > timestamps as column qualifiers? It's my curiosity asking again!
> > > >
> > > > Thanks for sharing!
> > > >
> > > > Regards,
> > > > Cristofer
> > > >
> > > > -----Mensagem original-----
> > > > De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com]
> > > > Enviada em: quarta-feira, 18 de julho de 2012 13:04
> > > > Para: user@hbase.apache.org
> > > > Assunto: Re: Rowkey hashing to avoid hotspotting
> > > >
> > > > Hi Cristofer,
> > > >
> > > > Data i store is test cell reports about a component. I have many test
> > > cell
> > > > reports for each model number + serial number combination. So to make
> > > > rowkey unique, I added timstamp.
> > > >
> > > >
> > > > On Wed, Jul 18, 2012 at 3:14 AM, Cristofer Weber <
> > > > cristofer.weber@neogrid.com> wrote:
> > > >
> > > > > So, Anand, there are some things that can help, but again, most of
> > > > > them are related with the famous access patterns.
> > > > >
> > > > > Sometimes is not easy to get more information about them in
> advance,
> > > > > but if you are replacing another system you can study its data
> > > > > distribution, grouping for counts, mean, changes over time, etc. It
> > is
> > > > > possible to analyze with partial data too, but it is risky because
> > you
> > > > > will be subjected to the way this partial data was gathered; sample
> > > > > data may not be representative.
> > > > >
> > > > > Salting your rowkey with a hash calculated over your model# will
> > > > > probably result in an uniform distribution over a range (if using
> > > > > modulus), and pre-spliting your table will balance your load over
> > your
> > > > Region Servers.
> > > > > Also, you will be able to recalculate your hash for your model#
> > before
> > > > > scanning for it, allowing for a scan over specific rowkey while
> > > > > restricting this scan by startRow and stopRow. Remember that if
> your
> > > > > rowkeys shares the same prefix they will probably be located in the
> > > > > same region and your scan will be favored by this.
> > > > >
> > > > > I'm still curious about your need of adding a timestamp after your
> > > > > model#,serial#... I have some background in manufacturing systems
> and
> > > > > usually a serial number is unique. But, of course, it's just
> > > > > curiosity.  :-)
> > > > >
> > > > > Regards,
> > > > > Cristofer
> > > > >
> > > > > -----Mensagem original-----
> > > > > De: Alex Baranau [mailto:alex.baranov.v@gmail.com] Enviada em:
> > > > > terça-feira, 17 de julho de 2012 12:53
> > > > > Para: user@hbase.apache.org
> > > > > Assunto: Re: Rowkey hashing to avoid hotspotting
> > > > >
> > > > > The most common reason for RS hotspotting during writing data in
> > HBase
> > > > > is writing rows with monotonically increasing/decreasing row keys.
> > > > > E.g. if you put timestamp in the first part of your key, then you
> are
> > > > > likely to have monotonically increasing row keys. You can find more
> > > > > info about this issue and how to solve it here: [1] and also you
> may
> > > > > want to look at already implemented salting solution [2].
> > > > >
> > > > > As for RS hotspotting during reading - it is hard to predict
> without
> > > > > knowing what it the most common data access patterns. E.g. putting
> > > > > model # in first part of a key may seem like a good distribution,
> but
> > > > > if your web site used mostly by Mercedes owners, the majority of
> the
> > > > > read load may be directed to just few regions. Again, salting can
> > help
> > > a
> > > > lot here.
> > > > >
> > > > > +1 to what Cristofer said on other things, esp: use partial key
> scans
> > > > > +were
> > > > > possible instead of filters and pre-split your table.
> > > > >
> > > > > Alex Baranau
> > > > > ------
> > > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> > > > > ElasticSearch - Solr
> > > > >
> > > > > [1] http://bit.ly/HnKjbc
> > > > > [2] https://github.com/sematext/HBaseWD
> > > > >
> > > > > On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan <
> > > > > ananthu2050@gmail.com> wrote:
> > > > >
> > > > > > Hi Cristofer,
> > > > > >
> > > > > > Thanks for elaborate response!!!
> > > > > >
> > > > > > I have no much information about production data as I work with
> > > > > > partial data. But based on discussion with my project partners, I
> > > > > > have some answers for you.
> > > > > >
> > > > > > Number of model numbers and serial numbers will be finite. Not so
> > > > many...
> > > > > > As far as I know,there is no predefined rule for model number or
> > > > > > serial number creation.
> > > > > >
> > > > > > I have two access pattern. I count the number of rows for a
> > specific
> > > > > > model number. I use rowkey filter for this. Also I filter the
> rows
> > > > > > based on model, serial number and some other columns. I scan the
> > > > > > table with column value filter for this case.
> > > > > >
> > > > > > I will evaluate salting as you have explained.
> > > > > >
> > > > > > Regards,
> > > > > > Anand.C
> > > > > >
> > > > > > On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
> > > > > > cristofer.weber@neogrid.com> wrote:
> > > > > >
> > > > > > > Hi Anand,
> > > > > > >
> > > > > > > As usual, the answer is that 'it depends'  :)
> > > > > > >
> > > > > > > I think that the main question here is: why are you afraid that
> > > > > > > this
> > > > > > setup
> > > > > > > would lead to region server hotspotting? Is because you don't
> > know
> > > > > > > how
> > > > > > your
> > > > > > > production data will seems?
> > > > > > >
> > > > > > > Based on what you told about your rowkey, you will query mostly
> > by
> > > > > > > providing model no. + serial no., but:
> > > > > > > 1 - How is your rowkey distribution? There are tons of
> different
> > > > > > > modelNumbers AND serialNumbers? Few modelNumbers and a lot of
> > > > > > > serialNumbers? Few of both?
> > > > > > > 2 - Putting modelNumber in front of your rowkey means that your
> > > > > > > data will be sorted by rowkey. So, what is the rule that
> > > > > > > determinates a modelNumber creation? Is it a sequential number
> > > > > > > that will be increased by time? If
> > > > > > so,
> > > > > > > are newer members accessed a lot more than older members? If
> not,
> > > > > > > what
> > > > > > will
> > > > > > > drive this number? Is it an encoding rule?
> > > > > > > 3 - Do you expect more write/read load over a few of these
> > > > > > > modelNumbers and/or serialNumbers? Will it be similar to a
> Pareto
> > > > > Distribution?
> > > > > > > Distributed over what?
> > > > > > >
> > > > > > > Also, two other things got my attention here...
> > > > > > > 1 - Why are you filtering with regex? If your queries are over
> > > > > > > model
> > > > > no.
> > > > > > +
> > > > > > > serial no., why don't you just scan starting by your
> > > > > > > modelNumber+SerialNumber, and stoping on your next
> SerialNumber?
> > > > > > > modelNumber+Or is there another access pattern that doesn't
> > > > > > > apply to your composited rowkey?
> > > > > > > 2 - Why do you have to add a timestamp to ensure uniqueness?
> > > > > > >
> > > > > > > Now, answering your question without more info about your data,
> > > > > > > you can apply hash in two ways:
> > > > > > > 1 - Generating a hash (MD5 is the most common as far as I read
> > > > > > > about) and using only this hash as your rowkey. Based on what
> you
> > > > > > > have told, this
> > > > > > way
> > > > > > > doesn't fit your needs, because you would not be able to do
> apply
> > > > > > > your filter anymore.
> > > > > > > 2 - Salting, by prefixing your current rowkey with a pinch of
> > hash.
> > > > > > Notice
> > > > > > > that the hash portion must be your rowkey prefix to ensure a
> kind
> > > > > > > of balanced distribution over something (where something is
> your
> > > > > > > region servers). I'm working with a case that is a bit similar
> to
> > > > > > > yours, and
> > > > > > what
> > > > > > > I'm doing right now is calculating the hashValue of my rowkey
> and
> > > > > > > using a Java Formatter to create a hex string to prepend to my
> > > > > > > rowkey. Something like a String.format("%03x", hashValue)
> > > > > > >
> > > > > > > In both cases, you still have to split your regions in advance,
> > > > > > > and it will be better to work your splitting before starting to
> > > > > > > feed your table with production data.
> > > > > > >
> > > > > > > Also, you have to study the consequences that changing your
> > rowkey
> > > > > > > will bring. It's not for free.
> > > > > > >
> > > > > > > There's a lot of words here and a lot of questions, so by now I
> > > > > > > feel I started to shoot in the dark. Try to understand your
> > > > > > > production data and
> > > > > > if
> > > > > > > you have more to share, for sure it will help!
> > > > > > >
> > > > > > > Regards,
> > > > > > > Cristofer
> > > > > > >
> > > > > > > -----Mensagem original-----
> > > > > > > De: AnandaVelMurugan Chandra Mohan [mailto:
> ananthu2050@gmail.com
> > ]
> > > > > > > Enviada em: segunda-feira, 16 de julho de 2012 02:30
> > > > > > > Para: user@hbase.apache.org
> > > > > > > Assunto: Rowkey hashing to avoid hotspotting
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I am using Hbase to store data about mechanical components.
> Each
> > > > > > component
> > > > > > > has model no. and serial no. and some other attributes.
> > > > > > >
> > > > > > > I would be querying my data mostly by model no. and serial no.
> So
> > > > > > > I created a composite key with these two attributes and added
> > > > > > > timestamp to make it unique.
> > > > > > >
> > > > > > > To filter the data, I use rowkey filter with regex string
> > > > > > > comparator and it works well with sample seed data. Now I am
> > > > > > > afraid whether this set up will lead to region server
> hotspotting
> > > > > > > when we load production data in HBase. I read hashing may solve
> > > > > > > this problem. Can some one help me in implementing hashing the
> > row
> > > > > > > key? Also I would want the row filter to
> > > > > > work
> > > > > > > as I have to display the number of components in a web page
> and I
> > > > > > > use row key filter for implementing that functionality? Any
> > > > > > > guidance would be of great help.
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > > Anand
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > Anand
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Alex Baranau
> > > > > ------
> > > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> > > > > ElasticSearch - Solr
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Anand
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Anand
> > >
> >
> >
> >
> > --
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> -
> > Solr
> >
>



-- 
Regards,
Anand

Re: Rowkey hashing to avoid hotspotting

Posted by syed kather <in...@gmail.com>.
Anand ,
     i had a case which i had combine 4 fields and made one row key .
serial number can be first part of rowkey and model number can be second
part . So that B-Search on Row key will be more faster because we can
reduce lot jump while doing B- Search
Note : if serial number is changing frequently then use serial number at
first part

  For solving hot spotting problem i am at present started implementing

http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/

In my case i had 20 million of rows in my hbase table.  i had the same
problem while reading in map reduce.

            Thanks and Regards,
        S SYED ABDUL KATHER



On Thu, Jul 19, 2012 at 8:52 PM, Alex Baranau <al...@gmail.com>wrote:

> > I read somewhere that HBase is not
> > good at handling more than 100 column families
>
> Heh. Usually it is not good to have more than two or three, actually.
> See [1], and may be also [2].
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> [1] http://hbase.apache.org/book/number.of.cfs.html
> [2]
> http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know
>
> On Thu, Jul 19, 2012 at 11:08 AM, AnandaVelMurugan Chandra Mohan <
> ananthu2050@gmail.com> wrote:
>
> > Hi Cristofer,
> >
> > No problem... I am happy to share and learn.. :)
> >
> > Regarding timestamp based column family, I haven't thought about it. But
> my
> > only concern is no of column families. I read somewhere that HBase is not
> > good at handling more than 100 column families.
> >
> >
> > On Wed, Jul 18, 2012 at 11:15 PM, Cristofer Weber <
> > cristofer.weber@neogrid.com> wrote:
> >
> > > Hi Anand!
> > >
> > > I see... sorry for being so curious, but since I started studying
> HBase I
> > > am curious about how people are modeling their tables, and in what
> kinds
> > of
> > > systems HBase is in use.
> > >
> > > Have you evaluated recording your reports in a distinct CF using
> > > timestamps as column qualifiers? It's my curiosity asking again!
> > >
> > > Thanks for sharing!
> > >
> > > Regards,
> > > Cristofer
> > >
> > > -----Mensagem original-----
> > > De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com]
> > > Enviada em: quarta-feira, 18 de julho de 2012 13:04
> > > Para: user@hbase.apache.org
> > > Assunto: Re: Rowkey hashing to avoid hotspotting
> > >
> > > Hi Cristofer,
> > >
> > > Data i store is test cell reports about a component. I have many test
> > cell
> > > reports for each model number + serial number combination. So to make
> > > rowkey unique, I added timstamp.
> > >
> > >
> > > On Wed, Jul 18, 2012 at 3:14 AM, Cristofer Weber <
> > > cristofer.weber@neogrid.com> wrote:
> > >
> > > > So, Anand, there are some things that can help, but again, most of
> > > > them are related with the famous access patterns.
> > > >
> > > > Sometimes is not easy to get more information about them in advance,
> > > > but if you are replacing another system you can study its data
> > > > distribution, grouping for counts, mean, changes over time, etc. It
> is
> > > > possible to analyze with partial data too, but it is risky because
> you
> > > > will be subjected to the way this partial data was gathered; sample
> > > > data may not be representative.
> > > >
> > > > Salting your rowkey with a hash calculated over your model# will
> > > > probably result in an uniform distribution over a range (if using
> > > > modulus), and pre-spliting your table will balance your load over
> your
> > > Region Servers.
> > > > Also, you will be able to recalculate your hash for your model#
> before
> > > > scanning for it, allowing for a scan over specific rowkey while
> > > > restricting this scan by startRow and stopRow. Remember that if your
> > > > rowkeys shares the same prefix they will probably be located in the
> > > > same region and your scan will be favored by this.
> > > >
> > > > I'm still curious about your need of adding a timestamp after your
> > > > model#,serial#... I have some background in manufacturing systems and
> > > > usually a serial number is unique. But, of course, it's just
> > > > curiosity.  :-)
> > > >
> > > > Regards,
> > > > Cristofer
> > > >
> > > > -----Mensagem original-----
> > > > De: Alex Baranau [mailto:alex.baranov.v@gmail.com] Enviada em:
> > > > terça-feira, 17 de julho de 2012 12:53
> > > > Para: user@hbase.apache.org
> > > > Assunto: Re: Rowkey hashing to avoid hotspotting
> > > >
> > > > The most common reason for RS hotspotting during writing data in
> HBase
> > > > is writing rows with monotonically increasing/decreasing row keys.
> > > > E.g. if you put timestamp in the first part of your key, then you are
> > > > likely to have monotonically increasing row keys. You can find more
> > > > info about this issue and how to solve it here: [1] and also you may
> > > > want to look at already implemented salting solution [2].
> > > >
> > > > As for RS hotspotting during reading - it is hard to predict without
> > > > knowing what it the most common data access patterns. E.g. putting
> > > > model # in first part of a key may seem like a good distribution, but
> > > > if your web site used mostly by Mercedes owners, the majority of the
> > > > read load may be directed to just few regions. Again, salting can
> help
> > a
> > > lot here.
> > > >
> > > > +1 to what Cristofer said on other things, esp: use partial key scans
> > > > +were
> > > > possible instead of filters and pre-split your table.
> > > >
> > > > Alex Baranau
> > > > ------
> > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> > > > ElasticSearch - Solr
> > > >
> > > > [1] http://bit.ly/HnKjbc
> > > > [2] https://github.com/sematext/HBaseWD
> > > >
> > > > On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan <
> > > > ananthu2050@gmail.com> wrote:
> > > >
> > > > > Hi Cristofer,
> > > > >
> > > > > Thanks for elaborate response!!!
> > > > >
> > > > > I have no much information about production data as I work with
> > > > > partial data. But based on discussion with my project partners, I
> > > > > have some answers for you.
> > > > >
> > > > > Number of model numbers and serial numbers will be finite. Not so
> > > many...
> > > > > As far as I know,there is no predefined rule for model number or
> > > > > serial number creation.
> > > > >
> > > > > I have two access pattern. I count the number of rows for a
> specific
> > > > > model number. I use rowkey filter for this. Also I filter the rows
> > > > > based on model, serial number and some other columns. I scan the
> > > > > table with column value filter for this case.
> > > > >
> > > > > I will evaluate salting as you have explained.
> > > > >
> > > > > Regards,
> > > > > Anand.C
> > > > >
> > > > > On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
> > > > > cristofer.weber@neogrid.com> wrote:
> > > > >
> > > > > > Hi Anand,
> > > > > >
> > > > > > As usual, the answer is that 'it depends'  :)
> > > > > >
> > > > > > I think that the main question here is: why are you afraid that
> > > > > > this
> > > > > setup
> > > > > > would lead to region server hotspotting? Is because you don't
> know
> > > > > > how
> > > > > your
> > > > > > production data will seems?
> > > > > >
> > > > > > Based on what you told about your rowkey, you will query mostly
> by
> > > > > > providing model no. + serial no., but:
> > > > > > 1 - How is your rowkey distribution? There are tons of different
> > > > > > modelNumbers AND serialNumbers? Few modelNumbers and a lot of
> > > > > > serialNumbers? Few of both?
> > > > > > 2 - Putting modelNumber in front of your rowkey means that your
> > > > > > data will be sorted by rowkey. So, what is the rule that
> > > > > > determinates a modelNumber creation? Is it a sequential number
> > > > > > that will be increased by time? If
> > > > > so,
> > > > > > are newer members accessed a lot more than older members? If not,
> > > > > > what
> > > > > will
> > > > > > drive this number? Is it an encoding rule?
> > > > > > 3 - Do you expect more write/read load over a few of these
> > > > > > modelNumbers and/or serialNumbers? Will it be similar to a Pareto
> > > > Distribution?
> > > > > > Distributed over what?
> > > > > >
> > > > > > Also, two other things got my attention here...
> > > > > > 1 - Why are you filtering with regex? If your queries are over
> > > > > > model
> > > > no.
> > > > > +
> > > > > > serial no., why don't you just scan starting by your
> > > > > > modelNumber+SerialNumber, and stoping on your next SerialNumber?
> > > > > > modelNumber+Or is there another access pattern that doesn't
> > > > > > apply to your composited rowkey?
> > > > > > 2 - Why do you have to add a timestamp to ensure uniqueness?
> > > > > >
> > > > > > Now, answering your question without more info about your data,
> > > > > > you can apply hash in two ways:
> > > > > > 1 - Generating a hash (MD5 is the most common as far as I read
> > > > > > about) and using only this hash as your rowkey. Based on what you
> > > > > > have told, this
> > > > > way
> > > > > > doesn't fit your needs, because you would not be able to do apply
> > > > > > your filter anymore.
> > > > > > 2 - Salting, by prefixing your current rowkey with a pinch of
> hash.
> > > > > Notice
> > > > > > that the hash portion must be your rowkey prefix to ensure a kind
> > > > > > of balanced distribution over something (where something is your
> > > > > > region servers). I'm working with a case that is a bit similar to
> > > > > > yours, and
> > > > > what
> > > > > > I'm doing right now is calculating the hashValue of my rowkey and
> > > > > > using a Java Formatter to create a hex string to prepend to my
> > > > > > rowkey. Something like a String.format("%03x", hashValue)
> > > > > >
> > > > > > In both cases, you still have to split your regions in advance,
> > > > > > and it will be better to work your splitting before starting to
> > > > > > feed your table with production data.
> > > > > >
> > > > > > Also, you have to study the consequences that changing your
> rowkey
> > > > > > will bring. It's not for free.
> > > > > >
> > > > > > There's a lot of words here and a lot of questions, so by now I
> > > > > > feel I started to shoot in the dark. Try to understand your
> > > > > > production data and
> > > > > if
> > > > > > you have more to share, for sure it will help!
> > > > > >
> > > > > > Regards,
> > > > > > Cristofer
> > > > > >
> > > > > > -----Mensagem original-----
> > > > > > De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com
> ]
> > > > > > Enviada em: segunda-feira, 16 de julho de 2012 02:30
> > > > > > Para: user@hbase.apache.org
> > > > > > Assunto: Rowkey hashing to avoid hotspotting
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am using Hbase to store data about mechanical components. Each
> > > > > component
> > > > > > has model no. and serial no. and some other attributes.
> > > > > >
> > > > > > I would be querying my data mostly by model no. and serial no. So
> > > > > > I created a composite key with these two attributes and added
> > > > > > timestamp to make it unique.
> > > > > >
> > > > > > To filter the data, I use rowkey filter with regex string
> > > > > > comparator and it works well with sample seed data. Now I am
> > > > > > afraid whether this set up will lead to region server hotspotting
> > > > > > when we load production data in HBase. I read hashing may solve
> > > > > > this problem. Can some one help me in implementing hashing the
> row
> > > > > > key? Also I would want the row filter to
> > > > > work
> > > > > > as I have to display the number of components in a web page and I
> > > > > > use row key filter for implementing that functionality? Any
> > > > > > guidance would be of great help.
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > Anand
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > > Anand
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Alex Baranau
> > > > ------
> > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> > > > ElasticSearch - Solr
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Anand
> > >
> >
> >
> >
> > --
> > Regards,
> > Anand
> >
>
>
>
> --
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>

Re: Rowkey hashing to avoid hotspotting

Posted by Alex Baranau <al...@gmail.com>.
> I read somewhere that HBase is not
> good at handling more than 100 column families

Heh. Usually it is not good to have more than two or three, actually.
See [1], and may be also [2].

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

[1] http://hbase.apache.org/book/number.of.cfs.html
[2] http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know

On Thu, Jul 19, 2012 at 11:08 AM, AnandaVelMurugan Chandra Mohan <
ananthu2050@gmail.com> wrote:

> Hi Cristofer,
>
> No problem... I am happy to share and learn.. :)
>
> Regarding timestamp based column family, I haven't thought about it. But my
> only concern is no of column families. I read somewhere that HBase is not
> good at handling more than 100 column families.
>
>
> On Wed, Jul 18, 2012 at 11:15 PM, Cristofer Weber <
> cristofer.weber@neogrid.com> wrote:
>
> > Hi Anand!
> >
> > I see... sorry for being so curious, but since I started studying HBase I
> > am curious about how people are modeling their tables, and in what kinds
> of
> > systems HBase is in use.
> >
> > Have you evaluated recording your reports in a distinct CF using
> > timestamps as column qualifiers? It's my curiosity asking again!
> >
> > Thanks for sharing!
> >
> > Regards,
> > Cristofer
> >
> > -----Mensagem original-----
> > De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com]
> > Enviada em: quarta-feira, 18 de julho de 2012 13:04
> > Para: user@hbase.apache.org
> > Assunto: Re: Rowkey hashing to avoid hotspotting
> >
> > Hi Cristofer,
> >
> > Data i store is test cell reports about a component. I have many test
> cell
> > reports for each model number + serial number combination. So to make
> > rowkey unique, I added timstamp.
> >
> >
> > On Wed, Jul 18, 2012 at 3:14 AM, Cristofer Weber <
> > cristofer.weber@neogrid.com> wrote:
> >
> > > So, Anand, there are some things that can help, but again, most of
> > > them are related with the famous access patterns.
> > >
> > > Sometimes is not easy to get more information about them in advance,
> > > but if you are replacing another system you can study its data
> > > distribution, grouping for counts, mean, changes over time, etc. It is
> > > possible to analyze with partial data too, but it is risky because you
> > > will be subjected to the way this partial data was gathered; sample
> > > data may not be representative.
> > >
> > > Salting your rowkey with a hash calculated over your model# will
> > > probably result in an uniform distribution over a range (if using
> > > modulus), and pre-spliting your table will balance your load over your
> > Region Servers.
> > > Also, you will be able to recalculate your hash for your model# before
> > > scanning for it, allowing for a scan over specific rowkey while
> > > restricting this scan by startRow and stopRow. Remember that if your
> > > rowkeys shares the same prefix they will probably be located in the
> > > same region and your scan will be favored by this.
> > >
> > > I'm still curious about your need of adding a timestamp after your
> > > model#,serial#... I have some background in manufacturing systems and
> > > usually a serial number is unique. But, of course, it's just
> > > curiosity.  :-)
> > >
> > > Regards,
> > > Cristofer
> > >
> > > -----Mensagem original-----
> > > De: Alex Baranau [mailto:alex.baranov.v@gmail.com] Enviada em:
> > > terça-feira, 17 de julho de 2012 12:53
> > > Para: user@hbase.apache.org
> > > Assunto: Re: Rowkey hashing to avoid hotspotting
> > >
> > > The most common reason for RS hotspotting during writing data in HBase
> > > is writing rows with monotonically increasing/decreasing row keys.
> > > E.g. if you put timestamp in the first part of your key, then you are
> > > likely to have monotonically increasing row keys. You can find more
> > > info about this issue and how to solve it here: [1] and also you may
> > > want to look at already implemented salting solution [2].
> > >
> > > As for RS hotspotting during reading - it is hard to predict without
> > > knowing what it the most common data access patterns. E.g. putting
> > > model # in first part of a key may seem like a good distribution, but
> > > if your web site used mostly by Mercedes owners, the majority of the
> > > read load may be directed to just few regions. Again, salting can help
> a
> > lot here.
> > >
> > > +1 to what Cristofer said on other things, esp: use partial key scans
> > > +were
> > > possible instead of filters and pre-split your table.
> > >
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> > > ElasticSearch - Solr
> > >
> > > [1] http://bit.ly/HnKjbc
> > > [2] https://github.com/sematext/HBaseWD
> > >
> > > On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan <
> > > ananthu2050@gmail.com> wrote:
> > >
> > > > Hi Cristofer,
> > > >
> > > > Thanks for elaborate response!!!
> > > >
> > > > I have no much information about production data as I work with
> > > > partial data. But based on discussion with my project partners, I
> > > > have some answers for you.
> > > >
> > > > Number of model numbers and serial numbers will be finite. Not so
> > many...
> > > > As far as I know,there is no predefined rule for model number or
> > > > serial number creation.
> > > >
> > > > I have two access pattern. I count the number of rows for a specific
> > > > model number. I use rowkey filter for this. Also I filter the rows
> > > > based on model, serial number and some other columns. I scan the
> > > > table with column value filter for this case.
> > > >
> > > > I will evaluate salting as you have explained.
> > > >
> > > > Regards,
> > > > Anand.C
> > > >
> > > > On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
> > > > cristofer.weber@neogrid.com> wrote:
> > > >
> > > > > Hi Anand,
> > > > >
> > > > > As usual, the answer is that 'it depends'  :)
> > > > >
> > > > > I think that the main question here is: why are you afraid that
> > > > > this
> > > > setup
> > > > > would lead to region server hotspotting? Is because you don't know
> > > > > how
> > > > your
> > > > > production data will seems?
> > > > >
> > > > > Based on what you told about your rowkey, you will query mostly by
> > > > > providing model no. + serial no., but:
> > > > > 1 - How is your rowkey distribution? There are tons of different
> > > > > modelNumbers AND serialNumbers? Few modelNumbers and a lot of
> > > > > serialNumbers? Few of both?
> > > > > 2 - Putting modelNumber in front of your rowkey means that your
> > > > > data will be sorted by rowkey. So, what is the rule that
> > > > > determinates a modelNumber creation? Is it a sequential number
> > > > > that will be increased by time? If
> > > > so,
> > > > > are newer members accessed a lot more than older members? If not,
> > > > > what
> > > > will
> > > > > drive this number? Is it an encoding rule?
> > > > > 3 - Do you expect more write/read load over a few of these
> > > > > modelNumbers and/or serialNumbers? Will it be similar to a Pareto
> > > Distribution?
> > > > > Distributed over what?
> > > > >
> > > > > Also, two other things got my attention here...
> > > > > 1 - Why are you filtering with regex? If your queries are over
> > > > > model
> > > no.
> > > > +
> > > > > serial no., why don't you just scan starting by your
> > > > > modelNumber+SerialNumber, and stoping on your next SerialNumber?
> > > > > modelNumber+Or is there another access pattern that doesn't
> > > > > apply to your composited rowkey?
> > > > > 2 - Why do you have to add a timestamp to ensure uniqueness?
> > > > >
> > > > > Now, answering your question without more info about your data,
> > > > > you can apply hash in two ways:
> > > > > 1 - Generating a hash (MD5 is the most common as far as I read
> > > > > about) and using only this hash as your rowkey. Based on what you
> > > > > have told, this
> > > > way
> > > > > doesn't fit your needs, because you would not be able to do apply
> > > > > your filter anymore.
> > > > > 2 - Salting, by prefixing your current rowkey with a pinch of hash.
> > > > Notice
> > > > > that the hash portion must be your rowkey prefix to ensure a kind
> > > > > of balanced distribution over something (where something is your
> > > > > region servers). I'm working with a case that is a bit similar to
> > > > > yours, and
> > > > what
> > > > > I'm doing right now is calculating the hashValue of my rowkey and
> > > > > using a Java Formatter to create a hex string to prepend to my
> > > > > rowkey. Something like a String.format("%03x", hashValue)
> > > > >
> > > > > In both cases, you still have to split your regions in advance,
> > > > > and it will be better to work your splitting before starting to
> > > > > feed your table with production data.
> > > > >
> > > > > Also, you have to study the consequences that changing your rowkey
> > > > > will bring. It's not for free.
> > > > >
> > > > > There's a lot of words here and a lot of questions, so by now I
> > > > > feel I started to shoot in the dark. Try to understand your
> > > > > production data and
> > > > if
> > > > > you have more to share, for sure it will help!
> > > > >
> > > > > Regards,
> > > > > Cristofer
> > > > >
> > > > > -----Mensagem original-----
> > > > > De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com]
> > > > > Enviada em: segunda-feira, 16 de julho de 2012 02:30
> > > > > Para: user@hbase.apache.org
> > > > > Assunto: Rowkey hashing to avoid hotspotting
> > > > >
> > > > > Hi,
> > > > >
> > > > > I am using Hbase to store data about mechanical components. Each
> > > > component
> > > > > has model no. and serial no. and some other attributes.
> > > > >
> > > > > I would be querying my data mostly by model no. and serial no. So
> > > > > I created a composite key with these two attributes and added
> > > > > timestamp to make it unique.
> > > > >
> > > > > To filter the data, I use rowkey filter with regex string
> > > > > comparator and it works well with sample seed data. Now I am
> > > > > afraid whether this set up will lead to region server hotspotting
> > > > > when we load production data in HBase. I read hashing may solve
> > > > > this problem. Can some one help me in implementing hashing the row
> > > > > key? Also I would want the row filter to
> > > > work
> > > > > as I have to display the number of components in a web page and I
> > > > > use row key filter for implementing that functionality? Any
> > > > > guidance would be of great help.
> > > > >
> > > > > --
> > > > > Regards,
> > > > > Anand
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Anand
> > > >
> > >
> > >
> > >
> > > --
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> > > ElasticSearch - Solr
> > >
> >
> >
> >
> > --
> > Regards,
> > Anand
> >
>
>
>
> --
> Regards,
> Anand
>



-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

Re: Rowkey hashing to avoid hotspotting

Posted by AnandaVelMurugan Chandra Mohan <an...@gmail.com>.
Hi Cristofer,

No problem... I am happy to share and learn.. :)

Regarding timestamp based column family, I haven't thought about it. But my
only concern is no of column families. I read somewhere that HBase is not
good at handling more than 100 column families.


On Wed, Jul 18, 2012 at 11:15 PM, Cristofer Weber <
cristofer.weber@neogrid.com> wrote:

> Hi Anand!
>
> I see... sorry for being so curious, but since I started studying HBase I
> am curious about how people are modeling their tables, and in what kinds of
> systems HBase is in use.
>
> Have you evaluated recording your reports in a distinct CF using
> timestamps as column qualifiers? It's my curiosity asking again!
>
> Thanks for sharing!
>
> Regards,
> Cristofer
>
> -----Mensagem original-----
> De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com]
> Enviada em: quarta-feira, 18 de julho de 2012 13:04
> Para: user@hbase.apache.org
> Assunto: Re: Rowkey hashing to avoid hotspotting
>
> Hi Cristofer,
>
> Data i store is test cell reports about a component. I have many test cell
> reports for each model number + serial number combination. So to make
> rowkey unique, I added timstamp.
>
>
> On Wed, Jul 18, 2012 at 3:14 AM, Cristofer Weber <
> cristofer.weber@neogrid.com> wrote:
>
> > So, Anand, there are some things that can help, but again, most of
> > them are related with the famous access patterns.
> >
> > Sometimes is not easy to get more information about them in advance,
> > but if you are replacing another system you can study its data
> > distribution, grouping for counts, mean, changes over time, etc. It is
> > possible to analyze with partial data too, but it is risky because you
> > will be subjected to the way this partial data was gathered; sample
> > data may not be representative.
> >
> > Salting your rowkey with a hash calculated over your model# will
> > probably result in an uniform distribution over a range (if using
> > modulus), and pre-spliting your table will balance your load over your
> Region Servers.
> > Also, you will be able to recalculate your hash for your model# before
> > scanning for it, allowing for a scan over specific rowkey while
> > restricting this scan by startRow and stopRow. Remember that if your
> > rowkeys shares the same prefix they will probably be located in the
> > same region and your scan will be favored by this.
> >
> > I'm still curious about your need of adding a timestamp after your
> > model#,serial#... I have some background in manufacturing systems and
> > usually a serial number is unique. But, of course, it's just
> > curiosity.  :-)
> >
> > Regards,
> > Cristofer
> >
> > -----Mensagem original-----
> > De: Alex Baranau [mailto:alex.baranov.v@gmail.com] Enviada em:
> > terça-feira, 17 de julho de 2012 12:53
> > Para: user@hbase.apache.org
> > Assunto: Re: Rowkey hashing to avoid hotspotting
> >
> > The most common reason for RS hotspotting during writing data in HBase
> > is writing rows with monotonically increasing/decreasing row keys.
> > E.g. if you put timestamp in the first part of your key, then you are
> > likely to have monotonically increasing row keys. You can find more
> > info about this issue and how to solve it here: [1] and also you may
> > want to look at already implemented salting solution [2].
> >
> > As for RS hotspotting during reading - it is hard to predict without
> > knowing what it the most common data access patterns. E.g. putting
> > model # in first part of a key may seem like a good distribution, but
> > if your web site used mostly by Mercedes owners, the majority of the
> > read load may be directed to just few regions. Again, salting can help a
> lot here.
> >
> > +1 to what Cristofer said on other things, esp: use partial key scans
> > +were
> > possible instead of filters and pre-split your table.
> >
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> > ElasticSearch - Solr
> >
> > [1] http://bit.ly/HnKjbc
> > [2] https://github.com/sematext/HBaseWD
> >
> > On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan <
> > ananthu2050@gmail.com> wrote:
> >
> > > Hi Cristofer,
> > >
> > > Thanks for elaborate response!!!
> > >
> > > I have no much information about production data as I work with
> > > partial data. But based on discussion with my project partners, I
> > > have some answers for you.
> > >
> > > Number of model numbers and serial numbers will be finite. Not so
> many...
> > > As far as I know,there is no predefined rule for model number or
> > > serial number creation.
> > >
> > > I have two access pattern. I count the number of rows for a specific
> > > model number. I use rowkey filter for this. Also I filter the rows
> > > based on model, serial number and some other columns. I scan the
> > > table with column value filter for this case.
> > >
> > > I will evaluate salting as you have explained.
> > >
> > > Regards,
> > > Anand.C
> > >
> > > On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
> > > cristofer.weber@neogrid.com> wrote:
> > >
> > > > Hi Anand,
> > > >
> > > > As usual, the answer is that 'it depends'  :)
> > > >
> > > > I think that the main question here is: why are you afraid that
> > > > this
> > > setup
> > > > would lead to region server hotspotting? Is because you don't know
> > > > how
> > > your
> > > > production data will seems?
> > > >
> > > > Based on what you told about your rowkey, you will query mostly by
> > > > providing model no. + serial no., but:
> > > > 1 - How is your rowkey distribution? There are tons of different
> > > > modelNumbers AND serialNumbers? Few modelNumbers and a lot of
> > > > serialNumbers? Few of both?
> > > > 2 - Putting modelNumber in front of your rowkey means that your
> > > > data will be sorted by rowkey. So, what is the rule that
> > > > determinates a modelNumber creation? Is it a sequential number
> > > > that will be increased by time? If
> > > so,
> > > > are newer members accessed a lot more than older members? If not,
> > > > what
> > > will
> > > > drive this number? Is it an encoding rule?
> > > > 3 - Do you expect more write/read load over a few of these
> > > > modelNumbers and/or serialNumbers? Will it be similar to a Pareto
> > Distribution?
> > > > Distributed over what?
> > > >
> > > > Also, two other things got my attention here...
> > > > 1 - Why are you filtering with regex? If your queries are over
> > > > model
> > no.
> > > +
> > > > serial no., why don't you just scan starting by your
> > > > modelNumber+SerialNumber, and stoping on your next SerialNumber?
> > > > modelNumber+Or is there another access pattern that doesn't
> > > > apply to your composited rowkey?
> > > > 2 - Why do you have to add a timestamp to ensure uniqueness?
> > > >
> > > > Now, answering your question without more info about your data,
> > > > you can apply hash in two ways:
> > > > 1 - Generating a hash (MD5 is the most common as far as I read
> > > > about) and using only this hash as your rowkey. Based on what you
> > > > have told, this
> > > way
> > > > doesn't fit your needs, because you would not be able to do apply
> > > > your filter anymore.
> > > > 2 - Salting, by prefixing your current rowkey with a pinch of hash.
> > > Notice
> > > > that the hash portion must be your rowkey prefix to ensure a kind
> > > > of balanced distribution over something (where something is your
> > > > region servers). I'm working with a case that is a bit similar to
> > > > yours, and
> > > what
> > > > I'm doing right now is calculating the hashValue of my rowkey and
> > > > using a Java Formatter to create a hex string to prepend to my
> > > > rowkey. Something like a String.format("%03x", hashValue)
> > > >
> > > > In both cases, you still have to split your regions in advance,
> > > > and it will be better to work your splitting before starting to
> > > > feed your table with production data.
> > > >
> > > > Also, you have to study the consequences that changing your rowkey
> > > > will bring. It's not for free.
> > > >
> > > > There's a lot of words here and a lot of questions, so by now I
> > > > feel I started to shoot in the dark. Try to understand your
> > > > production data and
> > > if
> > > > you have more to share, for sure it will help!
> > > >
> > > > Regards,
> > > > Cristofer
> > > >
> > > > -----Mensagem original-----
> > > > De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com]
> > > > Enviada em: segunda-feira, 16 de julho de 2012 02:30
> > > > Para: user@hbase.apache.org
> > > > Assunto: Rowkey hashing to avoid hotspotting
> > > >
> > > > Hi,
> > > >
> > > > I am using Hbase to store data about mechanical components. Each
> > > component
> > > > has model no. and serial no. and some other attributes.
> > > >
> > > > I would be querying my data mostly by model no. and serial no. So
> > > > I created a composite key with these two attributes and added
> > > > timestamp to make it unique.
> > > >
> > > > To filter the data, I use rowkey filter with regex string
> > > > comparator and it works well with sample seed data. Now I am
> > > > afraid whether this set up will lead to region server hotspotting
> > > > when we load production data in HBase. I read hashing may solve
> > > > this problem. Can some one help me in implementing hashing the row
> > > > key? Also I would want the row filter to
> > > work
> > > > as I have to display the number of components in a web page and I
> > > > use row key filter for implementing that functionality? Any
> > > > guidance would be of great help.
> > > >
> > > > --
> > > > Regards,
> > > > Anand
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Anand
> > >
> >
> >
> >
> > --
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> > ElasticSearch - Solr
> >
>
>
>
> --
> Regards,
> Anand
>



-- 
Regards,
Anand

RES: Rowkey hashing to avoid hotspotting

Posted by Cristofer Weber <cr...@neogrid.com>.
Hi Anand!

I see... sorry for being so curious, but since I started studying HBase I am curious about how people are modeling their tables, and in what kinds of systems HBase is in use.

Have you evaluated recording your reports in a distinct CF using timestamps as column qualifiers? It's my curiosity asking again!

Thanks for sharing!

Regards,
Cristofer

-----Mensagem original-----
De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com] 
Enviada em: quarta-feira, 18 de julho de 2012 13:04
Para: user@hbase.apache.org
Assunto: Re: Rowkey hashing to avoid hotspotting

Hi Cristofer,

Data i store is test cell reports about a component. I have many test cell reports for each model number + serial number combination. So to make rowkey unique, I added timstamp.


On Wed, Jul 18, 2012 at 3:14 AM, Cristofer Weber < cristofer.weber@neogrid.com> wrote:

> So, Anand, there are some things that can help, but again, most of 
> them are related with the famous access patterns.
>
> Sometimes is not easy to get more information about them in advance, 
> but if you are replacing another system you can study its data 
> distribution, grouping for counts, mean, changes over time, etc. It is 
> possible to analyze with partial data too, but it is risky because you 
> will be subjected to the way this partial data was gathered; sample 
> data may not be representative.
>
> Salting your rowkey with a hash calculated over your model# will 
> probably result in an uniform distribution over a range (if using 
> modulus), and pre-spliting your table will balance your load over your Region Servers.
> Also, you will be able to recalculate your hash for your model# before 
> scanning for it, allowing for a scan over specific rowkey while 
> restricting this scan by startRow and stopRow. Remember that if your 
> rowkeys shares the same prefix they will probably be located in the 
> same region and your scan will be favored by this.
>
> I'm still curious about your need of adding a timestamp after your 
> model#,serial#... I have some background in manufacturing systems and 
> usually a serial number is unique. But, of course, it's just 
> curiosity.  :-)
>
> Regards,
> Cristofer
>
> -----Mensagem original-----
> De: Alex Baranau [mailto:alex.baranov.v@gmail.com] Enviada em: 
> terça-feira, 17 de julho de 2012 12:53
> Para: user@hbase.apache.org
> Assunto: Re: Rowkey hashing to avoid hotspotting
>
> The most common reason for RS hotspotting during writing data in HBase 
> is writing rows with monotonically increasing/decreasing row keys. 
> E.g. if you put timestamp in the first part of your key, then you are 
> likely to have monotonically increasing row keys. You can find more 
> info about this issue and how to solve it here: [1] and also you may 
> want to look at already implemented salting solution [2].
>
> As for RS hotspotting during reading - it is hard to predict without 
> knowing what it the most common data access patterns. E.g. putting 
> model # in first part of a key may seem like a good distribution, but 
> if your web site used mostly by Mercedes owners, the majority of the 
> read load may be directed to just few regions. Again, salting can help a lot here.
>
> +1 to what Cristofer said on other things, esp: use partial key scans 
> +were
> possible instead of filters and pre-split your table.
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - 
> ElasticSearch - Solr
>
> [1] http://bit.ly/HnKjbc
> [2] https://github.com/sematext/HBaseWD
>
> On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan < 
> ananthu2050@gmail.com> wrote:
>
> > Hi Cristofer,
> >
> > Thanks for elaborate response!!!
> >
> > I have no much information about production data as I work with 
> > partial data. But based on discussion with my project partners, I 
> > have some answers for you.
> >
> > Number of model numbers and serial numbers will be finite. Not so many...
> > As far as I know,there is no predefined rule for model number or 
> > serial number creation.
> >
> > I have two access pattern. I count the number of rows for a specific 
> > model number. I use rowkey filter for this. Also I filter the rows 
> > based on model, serial number and some other columns. I scan the 
> > table with column value filter for this case.
> >
> > I will evaluate salting as you have explained.
> >
> > Regards,
> > Anand.C
> >
> > On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber < 
> > cristofer.weber@neogrid.com> wrote:
> >
> > > Hi Anand,
> > >
> > > As usual, the answer is that 'it depends'  :)
> > >
> > > I think that the main question here is: why are you afraid that 
> > > this
> > setup
> > > would lead to region server hotspotting? Is because you don't know 
> > > how
> > your
> > > production data will seems?
> > >
> > > Based on what you told about your rowkey, you will query mostly by 
> > > providing model no. + serial no., but:
> > > 1 - How is your rowkey distribution? There are tons of different 
> > > modelNumbers AND serialNumbers? Few modelNumbers and a lot of 
> > > serialNumbers? Few of both?
> > > 2 - Putting modelNumber in front of your rowkey means that your 
> > > data will be sorted by rowkey. So, what is the rule that 
> > > determinates a modelNumber creation? Is it a sequential number 
> > > that will be increased by time? If
> > so,
> > > are newer members accessed a lot more than older members? If not, 
> > > what
> > will
> > > drive this number? Is it an encoding rule?
> > > 3 - Do you expect more write/read load over a few of these 
> > > modelNumbers and/or serialNumbers? Will it be similar to a Pareto
> Distribution?
> > > Distributed over what?
> > >
> > > Also, two other things got my attention here...
> > > 1 - Why are you filtering with regex? If your queries are over 
> > > model
> no.
> > +
> > > serial no., why don't you just scan starting by your
> > > modelNumber+SerialNumber, and stoping on your next SerialNumber? 
> > > modelNumber+Or is there another access pattern that doesn't
> > > apply to your composited rowkey?
> > > 2 - Why do you have to add a timestamp to ensure uniqueness?
> > >
> > > Now, answering your question without more info about your data, 
> > > you can apply hash in two ways:
> > > 1 - Generating a hash (MD5 is the most common as far as I read
> > > about) and using only this hash as your rowkey. Based on what you 
> > > have told, this
> > way
> > > doesn't fit your needs, because you would not be able to do apply 
> > > your filter anymore.
> > > 2 - Salting, by prefixing your current rowkey with a pinch of hash.
> > Notice
> > > that the hash portion must be your rowkey prefix to ensure a kind 
> > > of balanced distribution over something (where something is your 
> > > region servers). I'm working with a case that is a bit similar to 
> > > yours, and
> > what
> > > I'm doing right now is calculating the hashValue of my rowkey and 
> > > using a Java Formatter to create a hex string to prepend to my 
> > > rowkey. Something like a String.format("%03x", hashValue)
> > >
> > > In both cases, you still have to split your regions in advance, 
> > > and it will be better to work your splitting before starting to 
> > > feed your table with production data.
> > >
> > > Also, you have to study the consequences that changing your rowkey 
> > > will bring. It's not for free.
> > >
> > > There's a lot of words here and a lot of questions, so by now I 
> > > feel I started to shoot in the dark. Try to understand your 
> > > production data and
> > if
> > > you have more to share, for sure it will help!
> > >
> > > Regards,
> > > Cristofer
> > >
> > > -----Mensagem original-----
> > > De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com] 
> > > Enviada em: segunda-feira, 16 de julho de 2012 02:30
> > > Para: user@hbase.apache.org
> > > Assunto: Rowkey hashing to avoid hotspotting
> > >
> > > Hi,
> > >
> > > I am using Hbase to store data about mechanical components. Each
> > component
> > > has model no. and serial no. and some other attributes.
> > >
> > > I would be querying my data mostly by model no. and serial no. So 
> > > I created a composite key with these two attributes and added 
> > > timestamp to make it unique.
> > >
> > > To filter the data, I use rowkey filter with regex string 
> > > comparator and it works well with sample seed data. Now I am 
> > > afraid whether this set up will lead to region server hotspotting 
> > > when we load production data in HBase. I read hashing may solve 
> > > this problem. Can some one help me in implementing hashing the row 
> > > key? Also I would want the row filter to
> > work
> > > as I have to display the number of components in a web page and I 
> > > use row key filter for implementing that functionality? Any 
> > > guidance would be of great help.
> > >
> > > --
> > > Regards,
> > > Anand
> > >
> >
> >
> >
> > --
> > Regards,
> > Anand
> >
>
>
>
> --
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - 
> ElasticSearch - Solr
>



--
Regards,
Anand

Re: Rowkey hashing to avoid hotspotting

Posted by AnandaVelMurugan Chandra Mohan <an...@gmail.com>.
Hi Cristofer,

Data i store is test cell reports about a component. I have many test cell
reports for each model number + serial number combination. So to make
rowkey unique, I added timstamp.


On Wed, Jul 18, 2012 at 3:14 AM, Cristofer Weber <
cristofer.weber@neogrid.com> wrote:

> So, Anand, there are some things that can help, but again, most of them
> are related with the famous access patterns.
>
> Sometimes is not easy to get more information about them in advance, but
> if you are replacing another system you can study its data distribution,
> grouping for counts, mean, changes over time, etc. It is possible to
> analyze with partial data too, but it is risky because you will be
> subjected to the way this partial data was gathered; sample data may not be
> representative.
>
> Salting your rowkey with a hash calculated over your model# will probably
> result in an uniform distribution over a range (if using modulus), and
> pre-spliting your table will balance your load over your Region Servers.
> Also, you will be able to recalculate your hash for your model# before
> scanning for it, allowing for a scan over specific rowkey while restricting
> this scan by startRow and stopRow. Remember that if your rowkeys shares the
> same prefix they will probably be located in the same region and your scan
> will be favored by this.
>
> I'm still curious about your need of adding a timestamp after your
> model#,serial#... I have some background in manufacturing systems and
> usually a serial number is unique. But, of course, it's just curiosity.  :-)
>
> Regards,
> Cristofer
>
> -----Mensagem original-----
> De: Alex Baranau [mailto:alex.baranov.v@gmail.com]
> Enviada em: terça-feira, 17 de julho de 2012 12:53
> Para: user@hbase.apache.org
> Assunto: Re: Rowkey hashing to avoid hotspotting
>
> The most common reason for RS hotspotting during writing data in HBase is
> writing rows with monotonically increasing/decreasing row keys. E.g. if you
> put timestamp in the first part of your key, then you are likely to have
> monotonically increasing row keys. You can find more info about this issue
> and how to solve it here: [1] and also you may want to look at already
> implemented salting solution [2].
>
> As for RS hotspotting during reading - it is hard to predict without
> knowing what it the most common data access patterns. E.g. putting model #
> in first part of a key may seem like a good distribution, but if your web
> site used mostly by Mercedes owners, the majority of the read load may be
> directed to just few regions. Again, salting can help a lot here.
>
> +1 to what Cristofer said on other things, esp: use partial key scans
> +were
> possible instead of filters and pre-split your table.
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> [1] http://bit.ly/HnKjbc
> [2] https://github.com/sematext/HBaseWD
>
> On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan <
> ananthu2050@gmail.com> wrote:
>
> > Hi Cristofer,
> >
> > Thanks for elaborate response!!!
> >
> > I have no much information about production data as I work with
> > partial data. But based on discussion with my project partners, I have
> > some answers for you.
> >
> > Number of model numbers and serial numbers will be finite. Not so many...
> > As far as I know,there is no predefined rule for model number or
> > serial number creation.
> >
> > I have two access pattern. I count the number of rows for a specific
> > model number. I use rowkey filter for this. Also I filter the rows
> > based on model, serial number and some other columns. I scan the table
> > with column value filter for this case.
> >
> > I will evaluate salting as you have explained.
> >
> > Regards,
> > Anand.C
> >
> > On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
> > cristofer.weber@neogrid.com> wrote:
> >
> > > Hi Anand,
> > >
> > > As usual, the answer is that 'it depends'  :)
> > >
> > > I think that the main question here is: why are you afraid that this
> > setup
> > > would lead to region server hotspotting? Is because you don't know
> > > how
> > your
> > > production data will seems?
> > >
> > > Based on what you told about your rowkey, you will query mostly by
> > > providing model no. + serial no., but:
> > > 1 - How is your rowkey distribution? There are tons of different
> > > modelNumbers AND serialNumbers? Few modelNumbers and a lot of
> > > serialNumbers? Few of both?
> > > 2 - Putting modelNumber in front of your rowkey means that your data
> > > will be sorted by rowkey. So, what is the rule that determinates a
> > > modelNumber creation? Is it a sequential number that will be
> > > increased by time? If
> > so,
> > > are newer members accessed a lot more than older members? If not,
> > > what
> > will
> > > drive this number? Is it an encoding rule?
> > > 3 - Do you expect more write/read load over a few of these
> > > modelNumbers and/or serialNumbers? Will it be similar to a Pareto
> Distribution?
> > > Distributed over what?
> > >
> > > Also, two other things got my attention here...
> > > 1 - Why are you filtering with regex? If your queries are over model
> no.
> > +
> > > serial no., why don't you just scan starting by your
> > > modelNumber+SerialNumber, and stoping on your next SerialNumber? Or
> > > modelNumber+is there another access pattern that doesn't
> > > apply to your composited rowkey?
> > > 2 - Why do you have to add a timestamp to ensure uniqueness?
> > >
> > > Now, answering your question without more info about your data, you
> > > can apply hash in two ways:
> > > 1 - Generating a hash (MD5 is the most common as far as I read
> > > about) and using only this hash as your rowkey. Based on what you
> > > have told, this
> > way
> > > doesn't fit your needs, because you would not be able to do apply
> > > your filter anymore.
> > > 2 - Salting, by prefixing your current rowkey with a pinch of hash.
> > Notice
> > > that the hash portion must be your rowkey prefix to ensure a kind of
> > > balanced distribution over something (where something is your region
> > > servers). I'm working with a case that is a bit similar to yours,
> > > and
> > what
> > > I'm doing right now is calculating the hashValue of my rowkey and
> > > using a Java Formatter to create a hex string to prepend to my
> > > rowkey. Something like a String.format("%03x", hashValue)
> > >
> > > In both cases, you still have to split your regions in advance, and
> > > it will be better to work your splitting before starting to feed
> > > your table with production data.
> > >
> > > Also, you have to study the consequences that changing your rowkey
> > > will bring. It's not for free.
> > >
> > > There's a lot of words here and a lot of questions, so by now I feel
> > > I started to shoot in the dark. Try to understand your production
> > > data and
> > if
> > > you have more to share, for sure it will help!
> > >
> > > Regards,
> > > Cristofer
> > >
> > > -----Mensagem original-----
> > > De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com]
> > > Enviada em: segunda-feira, 16 de julho de 2012 02:30
> > > Para: user@hbase.apache.org
> > > Assunto: Rowkey hashing to avoid hotspotting
> > >
> > > Hi,
> > >
> > > I am using Hbase to store data about mechanical components. Each
> > component
> > > has model no. and serial no. and some other attributes.
> > >
> > > I would be querying my data mostly by model no. and serial no. So I
> > > created a composite key with these two attributes and added
> > > timestamp to make it unique.
> > >
> > > To filter the data, I use rowkey filter with regex string comparator
> > > and it works well with sample seed data. Now I am afraid whether
> > > this set up will lead to region server hotspotting when we load
> > > production data in HBase. I read hashing may solve this problem. Can
> > > some one help me in implementing hashing the row key? Also I would
> > > want the row filter to
> > work
> > > as I have to display the number of components in a web page and I
> > > use row key filter for implementing that functionality? Any guidance
> > > would be of great help.
> > >
> > > --
> > > Regards,
> > > Anand
> > >
> >
> >
> >
> > --
> > Regards,
> > Anand
> >
>
>
>
> --
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>



-- 
Regards,
Anand

RES: Rowkey hashing to avoid hotspotting

Posted by Cristofer Weber <cr...@neogrid.com>.
So, Anand, there are some things that can help, but again, most of them are related with the famous access patterns. 

Sometimes is not easy to get more information about them in advance, but if you are replacing another system you can study its data distribution, grouping for counts, mean, changes over time, etc. It is possible to analyze with partial data too, but it is risky because you will be subjected to the way this partial data was gathered; sample data may not be representative. 

Salting your rowkey with a hash calculated over your model# will probably result in an uniform distribution over a range (if using modulus), and pre-spliting your table will balance your load over your Region Servers. Also, you will be able to recalculate your hash for your model# before scanning for it, allowing for a scan over specific rowkey while restricting this scan by startRow and stopRow. Remember that if your rowkeys shares the same prefix they will probably be located in the same region and your scan will be favored by this.

I'm still curious about your need of adding a timestamp after your model#,serial#... I have some background in manufacturing systems and usually a serial number is unique. But, of course, it's just curiosity.  :-) 

Regards,
Cristofer

-----Mensagem original-----
De: Alex Baranau [mailto:alex.baranov.v@gmail.com] 
Enviada em: terça-feira, 17 de julho de 2012 12:53
Para: user@hbase.apache.org
Assunto: Re: Rowkey hashing to avoid hotspotting

The most common reason for RS hotspotting during writing data in HBase is writing rows with monotonically increasing/decreasing row keys. E.g. if you put timestamp in the first part of your key, then you are likely to have monotonically increasing row keys. You can find more info about this issue and how to solve it here: [1] and also you may want to look at already implemented salting solution [2].

As for RS hotspotting during reading - it is hard to predict without knowing what it the most common data access patterns. E.g. putting model # in first part of a key may seem like a good distribution, but if your web site used mostly by Mercedes owners, the majority of the read load may be directed to just few regions. Again, salting can help a lot here.

+1 to what Cristofer said on other things, esp: use partial key scans 
+were
possible instead of filters and pre-split your table.

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

[1] http://bit.ly/HnKjbc
[2] https://github.com/sematext/HBaseWD

On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan < ananthu2050@gmail.com> wrote:

> Hi Cristofer,
>
> Thanks for elaborate response!!!
>
> I have no much information about production data as I work with 
> partial data. But based on discussion with my project partners, I have 
> some answers for you.
>
> Number of model numbers and serial numbers will be finite. Not so many...
> As far as I know,there is no predefined rule for model number or 
> serial number creation.
>
> I have two access pattern. I count the number of rows for a specific 
> model number. I use rowkey filter for this. Also I filter the rows 
> based on model, serial number and some other columns. I scan the table 
> with column value filter for this case.
>
> I will evaluate salting as you have explained.
>
> Regards,
> Anand.C
>
> On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber < 
> cristofer.weber@neogrid.com> wrote:
>
> > Hi Anand,
> >
> > As usual, the answer is that 'it depends'  :)
> >
> > I think that the main question here is: why are you afraid that this
> setup
> > would lead to region server hotspotting? Is because you don't know 
> > how
> your
> > production data will seems?
> >
> > Based on what you told about your rowkey, you will query mostly by 
> > providing model no. + serial no., but:
> > 1 - How is your rowkey distribution? There are tons of different 
> > modelNumbers AND serialNumbers? Few modelNumbers and a lot of 
> > serialNumbers? Few of both?
> > 2 - Putting modelNumber in front of your rowkey means that your data 
> > will be sorted by rowkey. So, what is the rule that determinates a 
> > modelNumber creation? Is it a sequential number that will be 
> > increased by time? If
> so,
> > are newer members accessed a lot more than older members? If not, 
> > what
> will
> > drive this number? Is it an encoding rule?
> > 3 - Do you expect more write/read load over a few of these 
> > modelNumbers and/or serialNumbers? Will it be similar to a Pareto Distribution?
> > Distributed over what?
> >
> > Also, two other things got my attention here...
> > 1 - Why are you filtering with regex? If your queries are over model no.
> +
> > serial no., why don't you just scan starting by your
> > modelNumber+SerialNumber, and stoping on your next SerialNumber? Or 
> > modelNumber+is there another access pattern that doesn't
> > apply to your composited rowkey?
> > 2 - Why do you have to add a timestamp to ensure uniqueness?
> >
> > Now, answering your question without more info about your data, you 
> > can apply hash in two ways:
> > 1 - Generating a hash (MD5 is the most common as far as I read 
> > about) and using only this hash as your rowkey. Based on what you 
> > have told, this
> way
> > doesn't fit your needs, because you would not be able to do apply 
> > your filter anymore.
> > 2 - Salting, by prefixing your current rowkey with a pinch of hash.
> Notice
> > that the hash portion must be your rowkey prefix to ensure a kind of 
> > balanced distribution over something (where something is your region 
> > servers). I'm working with a case that is a bit similar to yours, 
> > and
> what
> > I'm doing right now is calculating the hashValue of my rowkey and 
> > using a Java Formatter to create a hex string to prepend to my 
> > rowkey. Something like a String.format("%03x", hashValue)
> >
> > In both cases, you still have to split your regions in advance, and 
> > it will be better to work your splitting before starting to feed 
> > your table with production data.
> >
> > Also, you have to study the consequences that changing your rowkey 
> > will bring. It's not for free.
> >
> > There's a lot of words here and a lot of questions, so by now I feel 
> > I started to shoot in the dark. Try to understand your production 
> > data and
> if
> > you have more to share, for sure it will help!
> >
> > Regards,
> > Cristofer
> >
> > -----Mensagem original-----
> > De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com] 
> > Enviada em: segunda-feira, 16 de julho de 2012 02:30
> > Para: user@hbase.apache.org
> > Assunto: Rowkey hashing to avoid hotspotting
> >
> > Hi,
> >
> > I am using Hbase to store data about mechanical components. Each
> component
> > has model no. and serial no. and some other attributes.
> >
> > I would be querying my data mostly by model no. and serial no. So I 
> > created a composite key with these two attributes and added 
> > timestamp to make it unique.
> >
> > To filter the data, I use rowkey filter with regex string comparator 
> > and it works well with sample seed data. Now I am afraid whether 
> > this set up will lead to region server hotspotting when we load 
> > production data in HBase. I read hashing may solve this problem. Can 
> > some one help me in implementing hashing the row key? Also I would 
> > want the row filter to
> work
> > as I have to display the number of components in a web page and I 
> > use row key filter for implementing that functionality? Any guidance 
> > would be of great help.
> >
> > --
> > Regards,
> > Anand
> >
>
>
>
> --
> Regards,
> Anand
>



--
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

Re: Rowkey hashing to avoid hotspotting

Posted by Alex Baranau <al...@gmail.com>.
You might be right, when reading load concentrated on single/several RS
they will not act as dead as when it is hotspotting during writing. I think
I referred more to "uneven read load distribution" when called it
hotspotting while reading.

Caches will help for sure, but that might be not enough. Having
single/several RS sweating in a cluster more than others is already not a
very desired situation. Also it may be that it's not the specific set of
records within Regions on RS (read as "data blocks") which are under load,
but the whole regions that for some reason has more hot data (like in
example above: with keys prefixed with model, the whole several regions
containing data of same model may have data that is frequently accessed).
In this case HBase (depending on hardware) may not be able to fit all that
data in cache on this hot single (or several) RS. As opposed to situation
when this hot data distributed over many more RSs (which will act like
distributed cache) e.g. with salting.

In general, yes, you will not see as big issues with uneven *read* load
distribution over the cluster as you might see in case of uneven *write*
load distribution.

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

On Tue, Jul 17, 2012 at 12:44 PM, Michel Segel <mi...@hotmail.com>wrote:

> Reading hot spotting?
> Hmmm there's a cache and I don't see any real use cases where you would
> have it occur naturally.
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Jul 17, 2012, at 10:53 AM, Alex Baranau <al...@gmail.com>
> wrote:
>
> > The most common reason for RS hotspotting during writing data in HBase is
> > writing rows with monotonically increasing/decreasing row keys. E.g. if
> you
> > put timestamp in the first part of your key, then you are likely to have
> > monotonically increasing row keys. You can find more info about this
> issue
> > and how to solve it here: [1] and also you may want to look at already
> > implemented salting solution [2].
> >
> > As for RS hotspotting during reading - it is hard to predict without
> > knowing what it the most common data access patterns. E.g. putting model
> #
> > in first part of a key may seem like a good distribution, but if your web
> > site used mostly by Mercedes owners, the majority of the read load may be
> > directed to just few regions. Again, salting can help a lot here.
> >
> > +1 to what Cristofer said on other things, esp: use partial key scans
> were
> > possible instead of filters and pre-split your table.
> >
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> -
> > Solr
> >
> > [1] http://bit.ly/HnKjbc
> > [2] https://github.com/sematext/HBaseWD
> >
> > On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan <
> > ananthu2050@gmail.com> wrote:
> >
> >> Hi Cristofer,
> >>
> >> Thanks for elaborate response!!!
> >>
> >> I have no much information about production data as I work with partial
> >> data. But based on discussion with my project partners, I have some
> answers
> >> for you.
> >>
> >> Number of model numbers and serial numbers will be finite. Not so
> many...
> >> As far as I know,there is no predefined rule for model number or serial
> >> number creation.
> >>
> >> I have two access pattern. I count the number of rows for a specific
> model
> >> number. I use rowkey filter for this. Also I filter the rows based on
> >> model, serial number and some other columns. I scan the table with
> column
> >> value filter for this case.
> >>
> >> I will evaluate salting as you have explained.
> >>
> >> Regards,
> >> Anand.C
> >>
> >> On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
> >> cristofer.weber@neogrid.com> wrote:
> >>
> >>> Hi Anand,
> >>>
> >>> As usual, the answer is that 'it depends'  :)
> >>>
> >>> I think that the main question here is: why are you afraid that this
> >> setup
> >>> would lead to region server hotspotting? Is because you don't know how
> >> your
> >>> production data will seems?
> >>>
> >>> Based on what you told about your rowkey, you will query mostly by
> >>> providing model no. + serial no., but:
> >>> 1 - How is your rowkey distribution? There are tons of different
> >>> modelNumbers AND serialNumbers? Few modelNumbers and a lot of
> >>> serialNumbers? Few of both?
> >>> 2 - Putting modelNumber in front of your rowkey means that your data
> will
> >>> be sorted by rowkey. So, what is the rule that determinates a
> modelNumber
> >>> creation? Is it a sequential number that will be increased by time? If
> >> so,
> >>> are newer members accessed a lot more than older members? If not, what
> >> will
> >>> drive this number? Is it an encoding rule?
> >>> 3 - Do you expect more write/read load over a few of these modelNumbers
> >>> and/or serialNumbers? Will it be similar to a Pareto Distribution?
> >>> Distributed over what?
> >>>
> >>> Also, two other things got my attention here...
> >>> 1 - Why are you filtering with regex? If your queries are over model
> no.
> >> +
> >>> serial no., why don't you just scan starting by your
> >>> modelNumber+SerialNumber, and stoping on your next
> >>> modelNumber+SerialNumber? Or is there another access pattern that
> doesn't
> >>> apply to your composited rowkey?
> >>> 2 - Why do you have to add a timestamp to ensure uniqueness?
> >>>
> >>> Now, answering your question without more info about your data, you can
> >>> apply hash in two ways:
> >>> 1 - Generating a hash (MD5 is the most common as far as I read about)
> and
> >>> using only this hash as your rowkey. Based on what you have told, this
> >> way
> >>> doesn't fit your needs, because you would not be able to do apply your
> >>> filter anymore.
> >>> 2 - Salting, by prefixing your current rowkey with a pinch of hash.
> >> Notice
> >>> that the hash portion must be your rowkey prefix to ensure a kind of
> >>> balanced distribution over something (where something is your region
> >>> servers). I'm working with a case that is a bit similar to yours, and
> >> what
> >>> I'm doing right now is calculating the hashValue of my rowkey and
> using a
> >>> Java Formatter to create a hex string to prepend to my rowkey.
> Something
> >>> like a String.format("%03x", hashValue)
> >>>
> >>> In both cases, you still have to split your regions in advance, and it
> >>> will be better to work your splitting before starting to feed your
> table
> >>> with production data.
> >>>
> >>> Also, you have to study the consequences that changing your rowkey will
> >>> bring. It's not for free.
> >>>
> >>> There's a lot of words here and a lot of questions, so by now I feel I
> >>> started to shoot in the dark. Try to understand your production data
> and
> >> if
> >>> you have more to share, for sure it will help!
> >>>
> >>> Regards,
> >>> Cristofer
> >>>
> >>> -----Mensagem original-----
> >>> De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com]
> >>> Enviada em: segunda-feira, 16 de julho de 2012 02:30
> >>> Para: user@hbase.apache.org
> >>> Assunto: Rowkey hashing to avoid hotspotting
> >>>
> >>> Hi,
> >>>
> >>> I am using Hbase to store data about mechanical components. Each
> >> component
> >>> has model no. and serial no. and some other attributes.
> >>>
> >>> I would be querying my data mostly by model no. and serial no. So I
> >>> created a composite key with these two attributes and added timestamp
> to
> >>> make it unique.
> >>>
> >>> To filter the data, I use rowkey filter with regex string comparator
> and
> >>> it works well with sample seed data. Now I am afraid whether this set
> up
> >>> will lead to region server hotspotting when we load production data in
> >>> HBase. I read hashing may solve this problem. Can some one help me in
> >>> implementing hashing the row key? Also I would want the row filter to
> >> work
> >>> as I have to display the number of components in a web page and I use
> row
> >>> key filter for implementing that functionality? Any guidance would be
> of
> >>> great help.
> >>>
> >>> --
> >>> Regards,
> >>> Anand
> >>>
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Anand
> >>
> >
> >
> >
> > --
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> -
> > Solr
>



-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

Re: Rowkey hashing to avoid hotspotting

Posted by Michel Segel <mi...@hotmail.com>.
Reading hot spotting?
Hmmm there's a cache and I don't see any real use cases where you would have it occur naturally.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jul 17, 2012, at 10:53 AM, Alex Baranau <al...@gmail.com> wrote:

> The most common reason for RS hotspotting during writing data in HBase is
> writing rows with monotonically increasing/decreasing row keys. E.g. if you
> put timestamp in the first part of your key, then you are likely to have
> monotonically increasing row keys. You can find more info about this issue
> and how to solve it here: [1] and also you may want to look at already
> implemented salting solution [2].
> 
> As for RS hotspotting during reading - it is hard to predict without
> knowing what it the most common data access patterns. E.g. putting model #
> in first part of a key may seem like a good distribution, but if your web
> site used mostly by Mercedes owners, the majority of the read load may be
> directed to just few regions. Again, salting can help a lot here.
> 
> +1 to what Cristofer said on other things, esp: use partial key scans were
> possible instead of filters and pre-split your table.
> 
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
> 
> [1] http://bit.ly/HnKjbc
> [2] https://github.com/sematext/HBaseWD
> 
> On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan <
> ananthu2050@gmail.com> wrote:
> 
>> Hi Cristofer,
>> 
>> Thanks for elaborate response!!!
>> 
>> I have no much information about production data as I work with partial
>> data. But based on discussion with my project partners, I have some answers
>> for you.
>> 
>> Number of model numbers and serial numbers will be finite. Not so many...
>> As far as I know,there is no predefined rule for model number or serial
>> number creation.
>> 
>> I have two access pattern. I count the number of rows for a specific model
>> number. I use rowkey filter for this. Also I filter the rows based on
>> model, serial number and some other columns. I scan the table with column
>> value filter for this case.
>> 
>> I will evaluate salting as you have explained.
>> 
>> Regards,
>> Anand.C
>> 
>> On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
>> cristofer.weber@neogrid.com> wrote:
>> 
>>> Hi Anand,
>>> 
>>> As usual, the answer is that 'it depends'  :)
>>> 
>>> I think that the main question here is: why are you afraid that this
>> setup
>>> would lead to region server hotspotting? Is because you don't know how
>> your
>>> production data will seems?
>>> 
>>> Based on what you told about your rowkey, you will query mostly by
>>> providing model no. + serial no., but:
>>> 1 - How is your rowkey distribution? There are tons of different
>>> modelNumbers AND serialNumbers? Few modelNumbers and a lot of
>>> serialNumbers? Few of both?
>>> 2 - Putting modelNumber in front of your rowkey means that your data will
>>> be sorted by rowkey. So, what is the rule that determinates a modelNumber
>>> creation? Is it a sequential number that will be increased by time? If
>> so,
>>> are newer members accessed a lot more than older members? If not, what
>> will
>>> drive this number? Is it an encoding rule?
>>> 3 - Do you expect more write/read load over a few of these modelNumbers
>>> and/or serialNumbers? Will it be similar to a Pareto Distribution?
>>> Distributed over what?
>>> 
>>> Also, two other things got my attention here...
>>> 1 - Why are you filtering with regex? If your queries are over model no.
>> +
>>> serial no., why don't you just scan starting by your
>>> modelNumber+SerialNumber, and stoping on your next
>>> modelNumber+SerialNumber? Or is there another access pattern that doesn't
>>> apply to your composited rowkey?
>>> 2 - Why do you have to add a timestamp to ensure uniqueness?
>>> 
>>> Now, answering your question without more info about your data, you can
>>> apply hash in two ways:
>>> 1 - Generating a hash (MD5 is the most common as far as I read about) and
>>> using only this hash as your rowkey. Based on what you have told, this
>> way
>>> doesn't fit your needs, because you would not be able to do apply your
>>> filter anymore.
>>> 2 - Salting, by prefixing your current rowkey with a pinch of hash.
>> Notice
>>> that the hash portion must be your rowkey prefix to ensure a kind of
>>> balanced distribution over something (where something is your region
>>> servers). I'm working with a case that is a bit similar to yours, and
>> what
>>> I'm doing right now is calculating the hashValue of my rowkey and using a
>>> Java Formatter to create a hex string to prepend to my rowkey. Something
>>> like a String.format("%03x", hashValue)
>>> 
>>> In both cases, you still have to split your regions in advance, and it
>>> will be better to work your splitting before starting to feed your table
>>> with production data.
>>> 
>>> Also, you have to study the consequences that changing your rowkey will
>>> bring. It's not for free.
>>> 
>>> There's a lot of words here and a lot of questions, so by now I feel I
>>> started to shoot in the dark. Try to understand your production data and
>> if
>>> you have more to share, for sure it will help!
>>> 
>>> Regards,
>>> Cristofer
>>> 
>>> -----Mensagem original-----
>>> De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com]
>>> Enviada em: segunda-feira, 16 de julho de 2012 02:30
>>> Para: user@hbase.apache.org
>>> Assunto: Rowkey hashing to avoid hotspotting
>>> 
>>> Hi,
>>> 
>>> I am using Hbase to store data about mechanical components. Each
>> component
>>> has model no. and serial no. and some other attributes.
>>> 
>>> I would be querying my data mostly by model no. and serial no. So I
>>> created a composite key with these two attributes and added timestamp to
>>> make it unique.
>>> 
>>> To filter the data, I use rowkey filter with regex string comparator and
>>> it works well with sample seed data. Now I am afraid whether this set up
>>> will lead to region server hotspotting when we load production data in
>>> HBase. I read hashing may solve this problem. Can some one help me in
>>> implementing hashing the row key? Also I would want the row filter to
>> work
>>> as I have to display the number of components in a web page and I use row
>>> key filter for implementing that functionality? Any guidance would be of
>>> great help.
>>> 
>>> --
>>> Regards,
>>> Anand
>>> 
>> 
>> 
>> 
>> --
>> Regards,
>> Anand
>> 
> 
> 
> 
> -- 
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr

Re: Rowkey hashing to avoid hotspotting

Posted by Alex Baranau <al...@gmail.com>.
The most common reason for RS hotspotting during writing data in HBase is
writing rows with monotonically increasing/decreasing row keys. E.g. if you
put timestamp in the first part of your key, then you are likely to have
monotonically increasing row keys. You can find more info about this issue
and how to solve it here: [1] and also you may want to look at already
implemented salting solution [2].

As for RS hotspotting during reading - it is hard to predict without
knowing what it the most common data access patterns. E.g. putting model #
in first part of a key may seem like a good distribution, but if your web
site used mostly by Mercedes owners, the majority of the read load may be
directed to just few regions. Again, salting can help a lot here.

+1 to what Cristofer said on other things, esp: use partial key scans were
possible instead of filters and pre-split your table.

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

[1] http://bit.ly/HnKjbc
[2] https://github.com/sematext/HBaseWD

On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan <
ananthu2050@gmail.com> wrote:

> Hi Cristofer,
>
> Thanks for elaborate response!!!
>
> I have no much information about production data as I work with partial
> data. But based on discussion with my project partners, I have some answers
> for you.
>
> Number of model numbers and serial numbers will be finite. Not so many...
> As far as I know,there is no predefined rule for model number or serial
> number creation.
>
> I have two access pattern. I count the number of rows for a specific model
> number. I use rowkey filter for this. Also I filter the rows based on
> model, serial number and some other columns. I scan the table with column
> value filter for this case.
>
> I will evaluate salting as you have explained.
>
> Regards,
> Anand.C
>
> On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
> cristofer.weber@neogrid.com> wrote:
>
> > Hi Anand,
> >
> > As usual, the answer is that 'it depends'  :)
> >
> > I think that the main question here is: why are you afraid that this
> setup
> > would lead to region server hotspotting? Is because you don't know how
> your
> > production data will seems?
> >
> > Based on what you told about your rowkey, you will query mostly by
> > providing model no. + serial no., but:
> > 1 - How is your rowkey distribution? There are tons of different
> > modelNumbers AND serialNumbers? Few modelNumbers and a lot of
> > serialNumbers? Few of both?
> > 2 - Putting modelNumber in front of your rowkey means that your data will
> > be sorted by rowkey. So, what is the rule that determinates a modelNumber
> > creation? Is it a sequential number that will be increased by time? If
> so,
> > are newer members accessed a lot more than older members? If not, what
> will
> > drive this number? Is it an encoding rule?
> > 3 - Do you expect more write/read load over a few of these modelNumbers
> > and/or serialNumbers? Will it be similar to a Pareto Distribution?
> > Distributed over what?
> >
> > Also, two other things got my attention here...
> > 1 - Why are you filtering with regex? If your queries are over model no.
> +
> > serial no., why don't you just scan starting by your
> > modelNumber+SerialNumber, and stoping on your next
> > modelNumber+SerialNumber? Or is there another access pattern that doesn't
> > apply to your composited rowkey?
> > 2 - Why do you have to add a timestamp to ensure uniqueness?
> >
> > Now, answering your question without more info about your data, you can
> > apply hash in two ways:
> > 1 - Generating a hash (MD5 is the most common as far as I read about) and
> > using only this hash as your rowkey. Based on what you have told, this
> way
> > doesn't fit your needs, because you would not be able to do apply your
> > filter anymore.
> > 2 - Salting, by prefixing your current rowkey with a pinch of hash.
> Notice
> > that the hash portion must be your rowkey prefix to ensure a kind of
> > balanced distribution over something (where something is your region
> > servers). I'm working with a case that is a bit similar to yours, and
> what
> > I'm doing right now is calculating the hashValue of my rowkey and using a
> > Java Formatter to create a hex string to prepend to my rowkey. Something
> > like a String.format("%03x", hashValue)
> >
> > In both cases, you still have to split your regions in advance, and it
> > will be better to work your splitting before starting to feed your table
> > with production data.
> >
> > Also, you have to study the consequences that changing your rowkey will
> > bring. It's not for free.
> >
> > There's a lot of words here and a lot of questions, so by now I feel I
> > started to shoot in the dark. Try to understand your production data and
> if
> > you have more to share, for sure it will help!
> >
> > Regards,
> > Cristofer
> >
> > -----Mensagem original-----
> > De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com]
> > Enviada em: segunda-feira, 16 de julho de 2012 02:30
> > Para: user@hbase.apache.org
> > Assunto: Rowkey hashing to avoid hotspotting
> >
> > Hi,
> >
> > I am using Hbase to store data about mechanical components. Each
> component
> > has model no. and serial no. and some other attributes.
> >
> > I would be querying my data mostly by model no. and serial no. So I
> > created a composite key with these two attributes and added timestamp to
> > make it unique.
> >
> > To filter the data, I use rowkey filter with regex string comparator and
> > it works well with sample seed data. Now I am afraid whether this set up
> > will lead to region server hotspotting when we load production data in
> > HBase. I read hashing may solve this problem. Can some one help me in
> > implementing hashing the row key? Also I would want the row filter to
> work
> > as I have to display the number of components in a web page and I use row
> > key filter for implementing that functionality? Any guidance would be of
> > great help.
> >
> > --
> > Regards,
> > Anand
> >
>
>
>
> --
> Regards,
> Anand
>



-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

Re: Rowkey hashing to avoid hotspotting

Posted by AnandaVelMurugan Chandra Mohan <an...@gmail.com>.
Hi Cristofer,

Thanks for elaborate response!!!

I have no much information about production data as I work with partial
data. But based on discussion with my project partners, I have some answers
for you.

Number of model numbers and serial numbers will be finite. Not so many...
As far as I know,there is no predefined rule for model number or serial
number creation.

I have two access pattern. I count the number of rows for a specific model
number. I use rowkey filter for this. Also I filter the rows based on
model, serial number and some other columns. I scan the table with column
value filter for this case.

I will evaluate salting as you have explained.

Regards,
Anand.C

On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
cristofer.weber@neogrid.com> wrote:

> Hi Anand,
>
> As usual, the answer is that 'it depends'  :)
>
> I think that the main question here is: why are you afraid that this setup
> would lead to region server hotspotting? Is because you don't know how your
> production data will seems?
>
> Based on what you told about your rowkey, you will query mostly by
> providing model no. + serial no., but:
> 1 - How is your rowkey distribution? There are tons of different
> modelNumbers AND serialNumbers? Few modelNumbers and a lot of
> serialNumbers? Few of both?
> 2 - Putting modelNumber in front of your rowkey means that your data will
> be sorted by rowkey. So, what is the rule that determinates a modelNumber
> creation? Is it a sequential number that will be increased by time? If so,
> are newer members accessed a lot more than older members? If not, what will
> drive this number? Is it an encoding rule?
> 3 - Do you expect more write/read load over a few of these modelNumbers
> and/or serialNumbers? Will it be similar to a Pareto Distribution?
> Distributed over what?
>
> Also, two other things got my attention here...
> 1 - Why are you filtering with regex? If your queries are over model no. +
> serial no., why don't you just scan starting by your
> modelNumber+SerialNumber, and stoping on your next
> modelNumber+SerialNumber? Or is there another access pattern that doesn't
> apply to your composited rowkey?
> 2 - Why do you have to add a timestamp to ensure uniqueness?
>
> Now, answering your question without more info about your data, you can
> apply hash in two ways:
> 1 - Generating a hash (MD5 is the most common as far as I read about) and
> using only this hash as your rowkey. Based on what you have told, this way
> doesn't fit your needs, because you would not be able to do apply your
> filter anymore.
> 2 - Salting, by prefixing your current rowkey with a pinch of hash. Notice
> that the hash portion must be your rowkey prefix to ensure a kind of
> balanced distribution over something (where something is your region
> servers). I'm working with a case that is a bit similar to yours, and what
> I'm doing right now is calculating the hashValue of my rowkey and using a
> Java Formatter to create a hex string to prepend to my rowkey. Something
> like a String.format("%03x", hashValue)
>
> In both cases, you still have to split your regions in advance, and it
> will be better to work your splitting before starting to feed your table
> with production data.
>
> Also, you have to study the consequences that changing your rowkey will
> bring. It's not for free.
>
> There's a lot of words here and a lot of questions, so by now I feel I
> started to shoot in the dark. Try to understand your production data and if
> you have more to share, for sure it will help!
>
> Regards,
> Cristofer
>
> -----Mensagem original-----
> De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com]
> Enviada em: segunda-feira, 16 de julho de 2012 02:30
> Para: user@hbase.apache.org
> Assunto: Rowkey hashing to avoid hotspotting
>
> Hi,
>
> I am using Hbase to store data about mechanical components. Each component
> has model no. and serial no. and some other attributes.
>
> I would be querying my data mostly by model no. and serial no. So I
> created a composite key with these two attributes and added timestamp to
> make it unique.
>
> To filter the data, I use rowkey filter with regex string comparator and
> it works well with sample seed data. Now I am afraid whether this set up
> will lead to region server hotspotting when we load production data in
> HBase. I read hashing may solve this problem. Can some one help me in
> implementing hashing the row key? Also I would want the row filter to work
> as I have to display the number of components in a web page and I use row
> key filter for implementing that functionality? Any guidance would be of
> great help.
>
> --
> Regards,
> Anand
>



-- 
Regards,
Anand

RES: Rowkey hashing to avoid hotspotting

Posted by Cristofer Weber <cr...@neogrid.com>.
Hi Anand,

As usual, the answer is that 'it depends'  :)

I think that the main question here is: why are you afraid that this setup would lead to region server hotspotting? Is because you don't know how your production data will seems? 

Based on what you told about your rowkey, you will query mostly by providing model no. + serial no., but:
1 - How is your rowkey distribution? There are tons of different modelNumbers AND serialNumbers? Few modelNumbers and a lot of serialNumbers? Few of both?
2 - Putting modelNumber in front of your rowkey means that your data will be sorted by rowkey. So, what is the rule that determinates a modelNumber creation? Is it a sequential number that will be increased by time? If so, are newer members accessed a lot more than older members? If not, what will drive this number? Is it an encoding rule? 
3 - Do you expect more write/read load over a few of these modelNumbers and/or serialNumbers? Will it be similar to a Pareto Distribution? Distributed over what? 

Also, two other things got my attention here... 
1 - Why are you filtering with regex? If your queries are over model no. + serial no., why don't you just scan starting by your modelNumber+SerialNumber, and stoping on your next modelNumber+SerialNumber? Or is there another access pattern that doesn't apply to your composited rowkey?
2 - Why do you have to add a timestamp to ensure uniqueness?

Now, answering your question without more info about your data, you can apply hash in two ways:
1 - Generating a hash (MD5 is the most common as far as I read about) and using only this hash as your rowkey. Based on what you have told, this way doesn't fit your needs, because you would not be able to do apply your filter anymore.
2 - Salting, by prefixing your current rowkey with a pinch of hash. Notice that the hash portion must be your rowkey prefix to ensure a kind of balanced distribution over something (where something is your region servers). I'm working with a case that is a bit similar to yours, and what I'm doing right now is calculating the hashValue of my rowkey and using a Java Formatter to create a hex string to prepend to my rowkey. Something like a String.format("%03x", hashValue)

In both cases, you still have to split your regions in advance, and it will be better to work your splitting before starting to feed your table with production data. 

Also, you have to study the consequences that changing your rowkey will bring. It's not for free. 

There's a lot of words here and a lot of questions, so by now I feel I started to shoot in the dark. Try to understand your production data and if you have more to share, for sure it will help!

Regards,
Cristofer

-----Mensagem original-----
De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com] 
Enviada em: segunda-feira, 16 de julho de 2012 02:30
Para: user@hbase.apache.org
Assunto: Rowkey hashing to avoid hotspotting

Hi,

I am using Hbase to store data about mechanical components. Each component has model no. and serial no. and some other attributes.

I would be querying my data mostly by model no. and serial no. So I created a composite key with these two attributes and added timestamp to make it unique.

To filter the data, I use rowkey filter with regex string comparator and it works well with sample seed data. Now I am afraid whether this set up will lead to region server hotspotting when we load production data in HBase. I read hashing may solve this problem. Can some one help me in implementing hashing the row key? Also I would want the row filter to work as I have to display the number of components in a web page and I use row key filter for implementing that functionality? Any guidance would be of great help.

--
Regards,
Anand