You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Padmanaban <pa...@gmail.com> on 2012/07/26 08:34:02 UTC

Hbase Data Model to purge old data.

We have the following use case:

Store telecom CDR data on a per subscriber basis
data is time series based and every record is per-subscriber based
comes in round the clock 
the expected volume of data would be around 300 million records/day. 
this data is to be queried 24/7 by an online system where the filters are
subscriber id and date range

Since the volume of data is huge, we have data retention policies to archive old
data on a daily basis. 
For example, if retention is set to 90 days, every day a offline process would
delete data from Hbase which is older than 90 days and archive it on tape.

The current HBase data model design is as follows:
Separate table for every day's data with row key as subscriber id: reason for
this is bulk delete of one days data within a big table is more expensive than
dropping a one day table
In this per-day-separate-table model, the load balancer will never get triggered
as the current days table is always in memory, and daughter regions will
continuously get assigned to same region server. This leads to a region server
hotspots.

Please feedback on whether the per-day-separate-table model is the best-practice
for this use case considering the data life cycle management requirement. If
yes, how do we solve the side effect of region server hotspot? If no, please
advice alternate model

Thanks in advance,
Padmanaban M



Re: Hbase Data Model to purge old data.

Posted by Alex Baranau <al...@gmail.com>.
Very nice presentation. Awesome simulation tool!

Couldn't help to leave a comment. Or two.

1. It is even possible to set qualifier name to empty byte[]. This might
help to save you some extra byte(s) ;)

2. It looks like after several days you have in memstore a lot of data
which is not frequently accessed. I.e. those memstores of the regions that
holds several days+ old data. Would be great to use this valuable main
memory for storing frequently accessed data. Quick thoughts:
* perform manual flush of older regions' memstores periodically, this will
free that memory and then use it:
  ** for bigger memstore (I believe that should esp. improve your timings
for fetching data older than hour (there's kinda a spike on fetch time
chart there))
  ** for bigger block caches
  ** having more "hot" regions per RS

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

P.S. Any chance of converting the first video of simulation tool to gif or
smth and allow using for teaching? ;)

P.S.-2 Have you tried to connect in to the real cluster already? I know we
are all busy, but still hopes are that you'll find the time. Btw, I believe
it will be soon easier to integrate it as hbase metrics are getting a lot
of attention. They should be much more usable soon.

On Thu, Jul 26, 2012 at 1:06 PM, Cristofer Weber <
cristofer.weber@neogrid.com> wrote:

> Hi there
>
> There are some really good ideas in this presentation from HBaseCon:
> http://www.cloudera.com/resource/video-hbasecon-2012-real-performance-gains-with-real-time-data/
>
> Regards,
> Cristofer
>
> -----Mensagem original-----
> De: Alex Baranau [mailto:alex.baranov.v@gmail.com]
> Enviada em: quinta-feira, 26 de julho de 2012 11:28
> Para: user@hbase.apache.org
> Assunto: Re: Hbase Data Model to purge old data.
>
> > reason for
> > this is bulk delete of one days data within a big table is more
> > expensive
> than
> > dropping a one day table
>
> Sorry for the obvious question, but have you tried using TTLs instead of
> deleting rows explicitly? This should bring less load on the cluster,
> though you'll still have to run major_compaction, which might be a resource
> intensive process.
>
> > In this per-day-separate-table model, the load balancer will never get
> triggered
> > as the current days table is always in memory, and daughter regions
> > will continuously get assigned to same region server. This leads to a
> > region
> server
> > hotspots.
>
> Again, may be an obvious q: have you tried to (or is it possible in your
> case to) pre-split table so that regions are distributed over the cluster
> from the start?
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> On Thu, Jul 26, 2012 at 2:34 AM, Padmanaban <padmanaban.mathulu@gmail.com
> >wrote:
>
> > We have the following use case:
> >
> > Store telecom CDR data on a per subscriber basis data is time series
> > based and every record is per-subscriber based comes in round the
> > clock the expected volume of data would be around 300 million
> > records/day.
> > this data is to be queried 24/7 by an online system where the filters
> > are subscriber id and date range
> >
> > Since the volume of data is huge, we have data retention policies to
> > archive old data on a daily basis.
> > For example, if retention is set to 90 days, every day a offline
> > process would delete data from Hbase which is older than 90 days and
> > archive it on tape.
> >
> > The current HBase data model design is as follows:
> > Separate table for every day's data with row key as subscriber id:
> > reason for this is bulk delete of one days data within a big table is
> > more expensive than dropping a one day table In this
> > per-day-separate-table model, the load balancer will never get
> > triggered as the current days table is always in memory, and daughter
> > regions will continuously get assigned to same region server. This
> > leads to a region server hotspots.
> >
> > Please feedback on whether the per-day-separate-table model is the
> > best-practice for this use case considering the data life cycle
> > management requirement.
> > If
> > yes, how do we solve the side effect of region server hotspot? If no,
> > please advice alternate model
> >
> > Thanks in advance,
> > Padmanaban M
> >
> >
> >
>
>
> --
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>



-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

RES: Hbase Data Model to purge old data.

Posted by Cristofer Weber <cr...@neogrid.com>.
Hi there

There are some really good ideas in this presentation from HBaseCon: http://www.cloudera.com/resource/video-hbasecon-2012-real-performance-gains-with-real-time-data/

Regards,
Cristofer

-----Mensagem original-----
De: Alex Baranau [mailto:alex.baranov.v@gmail.com] 
Enviada em: quinta-feira, 26 de julho de 2012 11:28
Para: user@hbase.apache.org
Assunto: Re: Hbase Data Model to purge old data.

> reason for
> this is bulk delete of one days data within a big table is more 
> expensive
than
> dropping a one day table

Sorry for the obvious question, but have you tried using TTLs instead of deleting rows explicitly? This should bring less load on the cluster, though you'll still have to run major_compaction, which might be a resource intensive process.

> In this per-day-separate-table model, the load balancer will never get
triggered
> as the current days table is always in memory, and daughter regions 
> will continuously get assigned to same region server. This leads to a 
> region
server
> hotspots.

Again, may be an obvious q: have you tried to (or is it possible in your case to) pre-split table so that regions are distributed over the cluster from the start?

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

On Thu, Jul 26, 2012 at 2:34 AM, Padmanaban <pa...@gmail.com>wrote:

> We have the following use case:
>
> Store telecom CDR data on a per subscriber basis data is time series 
> based and every record is per-subscriber based comes in round the 
> clock the expected volume of data would be around 300 million 
> records/day.
> this data is to be queried 24/7 by an online system where the filters 
> are subscriber id and date range
>
> Since the volume of data is huge, we have data retention policies to 
> archive old data on a daily basis.
> For example, if retention is set to 90 days, every day a offline 
> process would delete data from Hbase which is older than 90 days and 
> archive it on tape.
>
> The current HBase data model design is as follows:
> Separate table for every day's data with row key as subscriber id: 
> reason for this is bulk delete of one days data within a big table is 
> more expensive than dropping a one day table In this 
> per-day-separate-table model, the load balancer will never get 
> triggered as the current days table is always in memory, and daughter 
> regions will continuously get assigned to same region server. This 
> leads to a region server hotspots.
>
> Please feedback on whether the per-day-separate-table model is the 
> best-practice for this use case considering the data life cycle 
> management requirement.
> If
> yes, how do we solve the side effect of region server hotspot? If no, 
> please advice alternate model
>
> Thanks in advance,
> Padmanaban M
>
>
>


--
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

Re: Hbase Data Model to purge old data.

Posted by Alex Baranau <al...@gmail.com>.
> reason for
> this is bulk delete of one days data within a big table is more expensive
than
> dropping a one day table

Sorry for the obvious question, but have you tried using TTLs instead of
deleting rows explicitly? This should bring less load on the cluster,
though you'll still have to run major_compaction, which might be a resource
intensive process.

> In this per-day-separate-table model, the load balancer will never get
triggered
> as the current days table is always in memory, and daughter regions will
> continuously get assigned to same region server. This leads to a region
server
> hotspots.

Again, may be an obvious q: have you tried to (or is it possible in your
case to) pre-split table so that regions are distributed over the cluster
from the start?

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

On Thu, Jul 26, 2012 at 2:34 AM, Padmanaban <pa...@gmail.com>wrote:

> We have the following use case:
>
> Store telecom CDR data on a per subscriber basis
> data is time series based and every record is per-subscriber based
> comes in round the clock
> the expected volume of data would be around 300 million records/day.
> this data is to be queried 24/7 by an online system where the filters are
> subscriber id and date range
>
> Since the volume of data is huge, we have data retention policies to
> archive old
> data on a daily basis.
> For example, if retention is set to 90 days, every day a offline process
> would
> delete data from Hbase which is older than 90 days and archive it on tape.
>
> The current HBase data model design is as follows:
> Separate table for every day's data with row key as subscriber id: reason
> for
> this is bulk delete of one days data within a big table is more expensive
> than
> dropping a one day table
> In this per-day-separate-table model, the load balancer will never get
> triggered
> as the current days table is always in memory, and daughter regions will
> continuously get assigned to same region server. This leads to a region
> server
> hotspots.
>
> Please feedback on whether the per-day-separate-table model is the
> best-practice
> for this use case considering the data life cycle management requirement.
> If
> yes, how do we solve the side effect of region server hotspot? If no,
> please
> advice alternate model
>
> Thanks in advance,
> Padmanaban M
>
>
>


-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr