You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ophir Cohen <op...@gmail.com> on 2011/05/09 11:59:25 UTC

Data retention in HBase

Hi All,
In my company currently we are working hard on deployment our cluster with
HBase.

We talking of ~20 nodes to hold pretty big data (~1TB per day).

As there is a lot of data, we need a retention method, i.e. a way to remove
old data.

The problem is that I can't/want to do it using TTL cause two reasons:

   1. Different retention policy for different customers.
   2. Policy might be changed.


Of course, I can do it using nightly (weekly?) MR job that runs on all data
and remove the old data.
There is few problems:

   1. Running on huge amount of data only to remove small portion of it.
   2. It'll be a heavily MR job.
   3. Need to perform main compaction afterwards - that will affect
   performance or even stop service (is that right???).

I might use BulkFileOutputFormat for that job - but still have those
problems.

As my data sorted by the retention policies (customers and time) I thought
of this option:

   1. Split regions and create region with 'candidates to removed'.
   2. Drop this region.


   - Is it possible to drop region?
   - Do you think it a good idea?
   - Any other ideas?

Thanks,

Ophir Cohen
LivePerson

Re: Data retention in HBase

Posted by Ophir Cohen <op...@gmail.com>.
PS
The deletion is matter of privacy, security and terms-of-service not only
storage problems...

On Mon, May 9, 2011 at 8:33 PM, Ophir Cohen <op...@gmail.com> wrote:

> Tell it to my company ;)
>
> It looks like a nice tool to have such an a region dropper...
> I'll take a look and will come back to discuss it.
> If I'll go this direction I'm sure going to automate it...
>
> Ophir
>
>
> On Mon, May 9, 2011 at 8:29 PM, Stack <st...@duboce.net> wrote:
>
>> On Mon, May 9, 2011 at 10:09 AM, Ophir Cohen <op...@gmail.com> wrote:
>> > Actually the main motivation to remove old rows is that we have storage
>> > limitations (and too much data...).
>> >
>>
>> Ophir: Haven't you heard.  'Real' bigdata men and women don't delete!
>>
>> I think you should try the sequence outlined in the previous mail.
>> Its the least intrusive means of evacuating a bunch of data all in the
>> one go.  Can you automate it?
>>
>> St.Ack
>>
>
>

Re: Data retention in HBase

Posted by Ophir Cohen <op...@gmail.com>.
Tell it to my company ;)

It looks like a nice tool to have such an a region dropper...
I'll take a look and will come back to discuss it.
If I'll go this direction I'm sure going to automate it...

Ophir

On Mon, May 9, 2011 at 8:29 PM, Stack <st...@duboce.net> wrote:

> On Mon, May 9, 2011 at 10:09 AM, Ophir Cohen <op...@gmail.com> wrote:
> > Actually the main motivation to remove old rows is that we have storage
> > limitations (and too much data...).
> >
>
> Ophir: Haven't you heard.  'Real' bigdata men and women don't delete!
>
> I think you should try the sequence outlined in the previous mail.
> Its the least intrusive means of evacuating a bunch of data all in the
> one go.  Can you automate it?
>
> St.Ack
>

Re: Data retention in HBase

Posted by Stack <st...@duboce.net>.
On Mon, May 9, 2011 at 10:09 AM, Ophir Cohen <op...@gmail.com> wrote:
> Actually the main motivation to remove old rows is that we have storage
> limitations (and too much data...).
>

Ophir: Haven't you heard.  'Real' bigdata men and women don't delete!

I think you should try the sequence outlined in the previous mail.
Its the least intrusive means of evacuating a bunch of data all in the
one go.  Can you automate it?

St.Ack

Re: Data retention in HBase

Posted by Ophir Cohen <op...@gmail.com>.
Thanks, good luck with the release...
Ophir

On Thu, May 12, 2011 at 8:05 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> > So, now with that and with the security/co-processors I can ask: when do
> you
> > think 0.92 going to deployed?
>
> When it's ready, there's no formal plan. We were targeting May 1st for
> our first release candidate but there's still a lot of work to do. I
> can count 12 blockers and bunch of criticals.
>
> > BTW
> > Do you have any simulator to run HBase master and region server to check
> > this code?
>
> See how the HBase unit tests use MiniHBaseCluster.
>
> J-D
>

Re: Data retention in HBase

Posted by Jean-Daniel Cryans <jd...@apache.org>.
> So, now with that and with the security/co-processors I can ask: when do you
> think 0.92 going to deployed?

When it's ready, there's no formal plan. We were targeting May 1st for
our first release candidate but there's still a lot of work to do. I
can count 12 blockers and bunch of criticals.

> BTW
> Do you have any simulator to run HBase master and region server to check
> this code?

See how the HBase unit tests use MiniHBaseCluster.

J-D

Re: Data retention in HBase

Posted by Ophir Cohen <op...@gmail.com>.
Some more results:

   1. Splits regions:
      1. In cloudera distribution (i.e. HBase 0.90.1) there is a bug: before
      compaction it check for spliting but never check
HRegion.splitRequest... In
      the trunk its already fixed.
      2. Also added to trunk way to split in specific row
(HBASE-3328<https://issues.apache.org/jira/browse/HBASE-3328>
       and HBASE-3437 <https://issues.apache.org/jira/browse/HBASE-3437>).
      3. Now missing update of the JSP page:
HBASE-3462<https://issues.apache.org/jira/browse/HBASE-3462>
   2. I've added issue for splitting regions in requested position rather in
   the mid file:
HBASE-3879<https://issues.apache.org/jira/browse/HBASE-3879> For
   me it'll be a great update...

So to summary all my finding:
Retention using region deletion can be good solution assuming:

   1. Your data sorted by the retention key.
   2. You have HBase 0.92 and higher

So, now with that and with the security/co-processors I can ask: when do you
think 0.92 going to deployed?

BTW
Do you have any simulator to run HBase master and region server to check
this code?
Ophir


On Wed, May 11, 2011 at 10:32 PM, Ophir Cohen <op...@gmail.com> wrote:

> Thanks for the comments,
>
> Going to work on it tomorrow - I'll keep you updated.
> Ophir
>
> On Wed, May 11, 2011 at 8:01 PM, Stack <st...@duboce.net> wrote:
>
>> On Wed, May 11, 2011 at 6:14 AM, Ophir Cohen <op...@gmail.com> wrote:
>> > My results from today's researches:
>> >
>> > I tried to delete region as Stack suggested:
>> >
>> >   1. *close_region*
>> >   2. Remove files from file system.
>> >   3. *assign* the region again.
>> >
>>
>> Try inserting something into that region and then getting it back out.
>>  Flush it explicitly . See that a file is added to hdfs.  Again get
>> the result back out.  That'll tell you for sure if it works.
>>
>
> Tried that already - and got the results back. Going to try it tomorrow
> with bigger size of data.
>
>>
>> >   1. Can I split a region by a specific key? It looks that it split
>> >   automatically.
>>
>> You can pass a key in the UI and in shell.  If the key exists, I
>> believe it will split on the passed key (You should confirm).  If the
>> key does not exist, it'll split on the closest.
>>
>
> In the web page just state that it'll be on the region that this key exists
> on. I'll try to trace it in the code as it seems not to work right now.
>
>>
>>
>> >   2. It seems that splitting from command line does not work... I get
>> the
>> >   message in the log but nothing really happened. Actually in the code
>> it
>> >   stated that it triggered compaction and that should be enough (????).
>>
>>
>> This sounds like a bug.  The UI uses same code path so bug is probably
>> in it too.    We might have to do some fixing herein.  Want to try
>> tracing where it goes awry?
>>
>
> I'll trace it out and let know. I'll file a bug if needed... And yes, it
> does not work not from the page nore from shell.
>
>>
>>
>> >   3. Is there a way to choose my method of region splitting? I think it
>> can
>> >   be a great option - way to state when and how region is splitted...
>> >
>>
>> No.  Its the size of the biggest store file that determines when we
>> split.  Its not currently pluggable.  But its a good idea (File an
>> issue?).  I'm not sure if coprocessors have influence over when a
>> split runs.
>>
>> OK. I'll see - it looks like a nice feature. For me it'll be exactly what
> I need - I'll split it by customers.
>
>
>> FYI, split check happens after compaction check.  That might be why
>> you see the compaction message in the above though you invoked a
>> split.
>>
>
> Yep that exaplain it. The comments in the code also stated that compaction
> is enough for making the split happend (and then the split isn't happen :()
>
>>
>> St.Ack
>>
>
>

Re: Data retention in HBase

Posted by Ophir Cohen <op...@gmail.com>.
Thanks for the comments,

Going to work on it tomorrow - I'll keep you updated.
Ophir

On Wed, May 11, 2011 at 8:01 PM, Stack <st...@duboce.net> wrote:

> On Wed, May 11, 2011 at 6:14 AM, Ophir Cohen <op...@gmail.com> wrote:
> > My results from today's researches:
> >
> > I tried to delete region as Stack suggested:
> >
> >   1. *close_region*
> >   2. Remove files from file system.
> >   3. *assign* the region again.
> >
>
> Try inserting something into that region and then getting it back out.
>  Flush it explicitly . See that a file is added to hdfs.  Again get
> the result back out.  That'll tell you for sure if it works.
>

Tried that already - and got the results back. Going to try it tomorrow with
bigger size of data.

>
> >   1. Can I split a region by a specific key? It looks that it split
> >   automatically.
>
> You can pass a key in the UI and in shell.  If the key exists, I
> believe it will split on the passed key (You should confirm).  If the
> key does not exist, it'll split on the closest.
>

In the web page just state that it'll be on the region that this key exists
on. I'll try to trace it in the code as it seems not to work right now.

>
>
> >   2. It seems that splitting from command line does not work... I get the
> >   message in the log but nothing really happened. Actually in the code it
> >   stated that it triggered compaction and that should be enough (????).
>
>
> This sounds like a bug.  The UI uses same code path so bug is probably
> in it too.    We might have to do some fixing herein.  Want to try
> tracing where it goes awry?
>

I'll trace it out and let know. I'll file a bug if needed... And yes, it
does not work not from the page nore from shell.

>
>
> >   3. Is there a way to choose my method of region splitting? I think it
> can
> >   be a great option - way to state when and how region is splitted...
> >
>
> No.  Its the size of the biggest store file that determines when we
> split.  Its not currently pluggable.  But its a good idea (File an
> issue?).  I'm not sure if coprocessors have influence over when a
> split runs.
>
> OK. I'll see - it looks like a nice feature. For me it'll be exactly what I
need - I'll split it by customers.


> FYI, split check happens after compaction check.  That might be why
> you see the compaction message in the above though you invoked a
> split.
>

Yep that exaplain it. The comments in the code also stated that compaction
is enough for making the split happend (and then the split isn't happen :()

>
> St.Ack
>

Re: Data retention in HBase

Posted by Stack <st...@duboce.net>.
On Wed, May 11, 2011 at 6:14 AM, Ophir Cohen <op...@gmail.com> wrote:
> My results from today's researches:
>
> I tried to delete region as Stack suggested:
>
>   1. *close_region*
>   2. Remove files from file system.
>   3. *assign* the region again.
>

Try inserting something into that region and then getting it back out.
 Flush it explicitly . See that a file is added to hdfs.  Again get
the result back out.  That'll tell you for sure if it works.

>   1. Can I split a region by a specific key? It looks that it split
>   automatically.

You can pass a key in the UI and in shell.  If the key exists, I
believe it will split on the passed key (You should confirm).  If the
key does not exist, it'll split on the closest.


>   2. It seems that splitting from command line does not work... I get the
>   message in the log but nothing really happened. Actually in the code it
>   stated that it triggered compaction and that should be enough (????).


This sounds like a bug.  The UI uses same code path so bug is probably
in it too.    We might have to do some fixing herein.  Want to try
tracing where it goes awry?


>   3. Is there a way to choose my method of region splitting? I think it can
>   be a great option - way to state when and how region is splitted...
>

No.  Its the size of the biggest store file that determines when we
split.  Its not currently pluggable.  But its a good idea (File an
issue?).  I'm not sure if coprocessors have influence over when a
split runs.

FYI, split check happens after compaction check.  That might be why
you see the compaction message in the above though you invoked a
split.

St.Ack

Re: Data retention in HBase

Posted by Ophir Cohen <op...@gmail.com>.
My results from today's researches:

I tried to delete region as Stack suggested:

   1. *close_region*
   2. Remove files from file system.
   3. *assign* the region again.

It looks like it works!
The region still exists but its empty.

Looks good but definitely not the end of the way.
In order to finalize this solution I still have those questions:


   1. Can I split a region by a specific key? It looks that it split
   automatically.
   2. It seems that splitting from command line does not work... I get the
   message in the log but nothing really happened. Actually in the code it
   stated that it triggered compaction and that should be enough (????).
   3. Is there a way to choose my method of region splitting? I think it can
   be a great option - way to state when and how region is splitted...

Any thoughts?
Thanks,
Ophir

BTW



On Tue, May 10, 2011 at 6:50 PM, Ophir Cohen <op...@gmail.com> wrote:

> OK, so to summarize the discussion (and rise some more problems) here is
> what I gathered:
>
> I have two options:
>
> 1. I can use map/reduce job on the rows I want to delete.
> Main problem here: after each job I need to run major compaction that will
> stop service at compaction time.
>
> Question:
>
>    - Why does major compaction should stop service (BTW I mainly concern
>    on insertions, with reading deny of service I can leave)?
>
> 2. Split for specific region and delete that region.
>
> Questions here:
>
>    - How does the META table updated after I close the region and remove
>    the files? Should I remove it from the META table as well?
>    - Why I need to disable the table? For how much time, do you think, I
>    need to disable it? Can I bypass it?
>
> I'm going to execute some tests tomorrow on that subject so any comments
> will be helpful.
> I'll keep you updated with the results.
>
> Thanks again,
> Ophir
>
>
> On Mon, May 9, 2011 at 8:34 PM, Ted Dunning <td...@maprtech.com> wrote:
>
>> If you change your key to "date - customer id - time stamp - session id"
>> then you shouldn't lose any important
>> data locality, but you would be able to delete things more efficiently.
>>
>> For one thing, any map-reduce programs that are running for deleting would
>> be doing dense scans over a small
>> part of your data. That might make them run much faster.
>>
>> For another, you should be able to do the region switch trick and then
>> drop
>> entire regions. That has the unfortunate
>> side-effect of requiring that you disable the table for a short period (I
>> think).
>>
>> On Mon, May 9, 2011 at 10:09 AM, Ophir Cohen <op...@gmail.com> wrote:
>>
>> > Thanks for the answer!
>> >
>> > A little bit more info:
>> > Our data is internal events grouped for sessions (i.e. group of events).
>> > There is differnet sessions to differnet customers.
>> > We talking about millions sessions per day.
>> >
>> > The key is *customer id - time stamp - sessions id.
>> > *
>> > So, yes it sorted by customer and date and as I want to remove rows by
>> > customer and date - it sorted all right.
>> > Actually the main motivation to remove old rows is that we have storage
>> > limitations (and too much data...).
>> >
>> > So, my concern if we can do something better than nightly/weekly map
>> reduce
>> > job that will ends up with a major compaction.
>> > Ophir
>> > PS
>> > The majorty of my customers share the same retention policy but I still
>> > need
>> > abilty to change it for a specific customer.
>> >
>> >
>> > On Mon, May 9, 2011 at 6:48 PM, Ted Dunning <td...@maprtech.com>
>> wrote:
>> >
>> > > Can you say a bit more about your data organization?
>> > >
>> > > Are you storing transactions of some kind? If so an your key involve
>> > > time?
>> > > I think that putting some extract of time (day number perhaps) as a
>> > > leading
>> > >
>> > > Are you storing profiles where the key is the user (or something) id
>> and
>> > > the
>> > > data is essentially a list of transactions? If so, can you segregate
>> > > transactions into separate column families that can be dropped as data
>> > > expires?
>> > >
>> > > When you say data expiration varies by customer, is that really
>> necessary
>> > > or
>> > > can you have a lowest common denominator for actual deletions with
>> rules
>> > > that govern how much data is actually visible to the consumer of the
>> > data?
>> > >
>> > > On Mon, May 9, 2011 at 2:59 AM, Ophir Cohen <op...@gmail.com> wrote:
>> > >
>> > > > Hi All,
>> > > > In my company currently we are working hard on deployment our
>> cluster
>> > > with
>> > > > HBase.
>> > > >
>> > > > We talking of ~20 nodes to hold pretty big data (~1TB per day).
>> > > >
>> > > > As there is a lot of data, we need a retention method, i.e. a way to
>> > > remove
>> > > > old data.
>> > > >
>> > > > The problem is that I can't/want to do it using TTL cause two
>> reasons:
>> > > >
>> > > > 1. Different retention policy for different customers.
>> > > > 2. Policy might be changed.
>> > > >
>> > > >
>> > > > Of course, I can do it using nightly (weekly?) MR job that runs on
>> all
>> > > data
>> > > > and remove the old data.
>> > > > There is few problems:
>> > > >
>> > > > 1. Running on huge amount of data only to remove small portion of
>> it.
>> > > > 2. It'll be a heavily MR job.
>> > > > 3. Need to perform main compaction afterwards - that will affect
>> > > > performance or even stop service (is that right???).
>> > > >
>> > > > I might use BulkFileOutputFormat for that job - but still have those
>> > > > problems.
>> > > >
>> > > > As my data sorted by the retention policies (customers and time) I
>> > > thought
>> > > > of this option:
>> > > >
>> > > > 1. Split regions and create region with 'candidates to removed'.
>> > > > 2. Drop this region.
>> > > >
>> > > >
>> > > > - Is it possible to drop region?
>> > > > - Do you think it a good idea?
>> > > > - Any other ideas?
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Ophir Cohen
>> > > > LivePerson
>> > > >
>> > >
>> >
>>
>
>

Re: Data retention in HBase

Posted by Ophir Cohen <op...@gmail.com>.
OK, so to summarize the discussion (and rise some more problems) here is
what I gathered:

I have two options:

1. I can use map/reduce job on the rows I want to delete.
Main problem here: after each job I need to run major compaction that will
stop service at compaction time.

Question:

   - Why does major compaction should stop service (BTW I mainly concern on
   insertions, with reading deny of service I can leave)?

2. Split for specific region and delete that region.

Questions here:

   - How does the META table updated after I close the region and remove the
   files? Should I remove it from the META table as well?
   - Why I need to disable the table? For how much time, do you think, I
   need to disable it? Can I bypass it?

I'm going to execute some tests tomorrow on that subject so any comments
will be helpful.
I'll keep you updated with the results.

Thanks again,
Ophir


On Mon, May 9, 2011 at 8:34 PM, Ted Dunning <td...@maprtech.com> wrote:

> If you change your key to "date - customer id - time stamp - session id"
> then you shouldn't lose any important
> data locality, but you would be able to delete things more efficiently.
>
> For one thing, any map-reduce programs that are running for deleting would
> be doing dense scans over a small
> part of your data. That might make them run much faster.
>
> For another, you should be able to do the region switch trick and then drop
> entire regions. That has the unfortunate
> side-effect of requiring that you disable the table for a short period (I
> think).
>
> On Mon, May 9, 2011 at 10:09 AM, Ophir Cohen <op...@gmail.com> wrote:
>
> > Thanks for the answer!
> >
> > A little bit more info:
> > Our data is internal events grouped for sessions (i.e. group of events).
> > There is differnet sessions to differnet customers.
> > We talking about millions sessions per day.
> >
> > The key is *customer id - time stamp - sessions id.
> > *
> > So, yes it sorted by customer and date and as I want to remove rows by
> > customer and date - it sorted all right.
> > Actually the main motivation to remove old rows is that we have storage
> > limitations (and too much data...).
> >
> > So, my concern if we can do something better than nightly/weekly map
> reduce
> > job that will ends up with a major compaction.
> > Ophir
> > PS
> > The majorty of my customers share the same retention policy but I still
> > need
> > abilty to change it for a specific customer.
> >
> >
> > On Mon, May 9, 2011 at 6:48 PM, Ted Dunning <td...@maprtech.com>
> wrote:
> >
> > > Can you say a bit more about your data organization?
> > >
> > > Are you storing transactions of some kind? If so an your key involve
> > > time?
> > > I think that putting some extract of time (day number perhaps) as a
> > > leading
> > >
> > > Are you storing profiles where the key is the user (or something) id
> and
> > > the
> > > data is essentially a list of transactions? If so, can you segregate
> > > transactions into separate column families that can be dropped as data
> > > expires?
> > >
> > > When you say data expiration varies by customer, is that really
> necessary
> > > or
> > > can you have a lowest common denominator for actual deletions with
> rules
> > > that govern how much data is actually visible to the consumer of the
> > data?
> > >
> > > On Mon, May 9, 2011 at 2:59 AM, Ophir Cohen <op...@gmail.com> wrote:
> > >
> > > > Hi All,
> > > > In my company currently we are working hard on deployment our cluster
> > > with
> > > > HBase.
> > > >
> > > > We talking of ~20 nodes to hold pretty big data (~1TB per day).
> > > >
> > > > As there is a lot of data, we need a retention method, i.e. a way to
> > > remove
> > > > old data.
> > > >
> > > > The problem is that I can't/want to do it using TTL cause two
> reasons:
> > > >
> > > > 1. Different retention policy for different customers.
> > > > 2. Policy might be changed.
> > > >
> > > >
> > > > Of course, I can do it using nightly (weekly?) MR job that runs on
> all
> > > data
> > > > and remove the old data.
> > > > There is few problems:
> > > >
> > > > 1. Running on huge amount of data only to remove small portion of it.
> > > > 2. It'll be a heavily MR job.
> > > > 3. Need to perform main compaction afterwards - that will affect
> > > > performance or even stop service (is that right???).
> > > >
> > > > I might use BulkFileOutputFormat for that job - but still have those
> > > > problems.
> > > >
> > > > As my data sorted by the retention policies (customers and time) I
> > > thought
> > > > of this option:
> > > >
> > > > 1. Split regions and create region with 'candidates to removed'.
> > > > 2. Drop this region.
> > > >
> > > >
> > > > - Is it possible to drop region?
> > > > - Do you think it a good idea?
> > > > - Any other ideas?
> > > >
> > > > Thanks,
> > > >
> > > > Ophir Cohen
> > > > LivePerson
> > > >
> > >
> >
>

Re: Data retention in HBase

Posted by Ted Dunning <td...@maprtech.com>.
If you change your key to "date - customer id - time stamp - session id"
then you shouldn't lose any important
data locality, but you would be able to delete things more efficiently.

For one thing, any map-reduce programs that are running for deleting would
be doing dense scans over a small
part of your data.  That might make them run much faster.

For another, you should be able to do the region switch trick and then drop
entire regions.  That has the unfortunate
side-effect of requiring that you disable the table for a short period (I
think).

On Mon, May 9, 2011 at 10:09 AM, Ophir Cohen <op...@gmail.com> wrote:

> Thanks for the answer!
>
> A little bit more info:
> Our data is internal events grouped for sessions (i.e. group of events).
> There is differnet sessions to differnet customers.
> We talking about millions sessions per day.
>
> The key is *customer id - time stamp - sessions id.
> *
> So, yes it sorted by customer and date and as I want to remove rows by
> customer and date - it sorted all right.
> Actually the main motivation to remove old rows is that we have storage
> limitations (and too much data...).
>
> So, my concern if we can do something better than nightly/weekly map reduce
> job that will ends up with a major compaction.
> Ophir
> PS
> The majorty of my customers share the same retention policy but I still
> need
> abilty to change it for a specific customer.
>
>
> On Mon, May 9, 2011 at 6:48 PM, Ted Dunning <td...@maprtech.com> wrote:
>
> > Can you say a bit more about your data organization?
> >
> > Are you storing transactions of some kind?   If so an your key involve
> > time?
> >  I think that putting some extract of time (day number perhaps) as a
> > leading
> >
> > Are you storing profiles where the key is the user (or something) id and
> > the
> > data is essentially a list of transactions?  If so, can you segregate
> > transactions into separate column families that can be dropped as data
> > expires?
> >
> > When you say data expiration varies by customer, is that really necessary
> > or
> > can you have a lowest common denominator for actual deletions with rules
> > that govern how much data is actually visible to the consumer of the
> data?
> >
> > On Mon, May 9, 2011 at 2:59 AM, Ophir Cohen <op...@gmail.com> wrote:
> >
> > > Hi All,
> > > In my company currently we are working hard on deployment our cluster
> > with
> > > HBase.
> > >
> > > We talking of ~20 nodes to hold pretty big data (~1TB per day).
> > >
> > > As there is a lot of data, we need a retention method, i.e. a way to
> > remove
> > > old data.
> > >
> > > The problem is that I can't/want to do it using TTL cause two reasons:
> > >
> > >   1. Different retention policy for different customers.
> > >   2. Policy might be changed.
> > >
> > >
> > > Of course, I can do it using nightly (weekly?) MR job that runs on all
> > data
> > > and remove the old data.
> > > There is few problems:
> > >
> > >   1. Running on huge amount of data only to remove small portion of it.
> > >   2. It'll be a heavily MR job.
> > >   3. Need to perform main compaction afterwards - that will affect
> > >   performance or even stop service (is that right???).
> > >
> > > I might use BulkFileOutputFormat for that job - but still have those
> > > problems.
> > >
> > > As my data sorted by the retention policies (customers and time) I
> > thought
> > > of this option:
> > >
> > >   1. Split regions and create region with 'candidates to removed'.
> > >   2. Drop this region.
> > >
> > >
> > >   - Is it possible to drop region?
> > >   - Do you think it a good idea?
> > >   - Any other ideas?
> > >
> > > Thanks,
> > >
> > > Ophir Cohen
> > > LivePerson
> > >
> >
>

Re: Data retention in HBase

Posted by Ophir Cohen <op...@gmail.com>.
Thanks for the answer!

A little bit more info:
Our data is internal events grouped for sessions (i.e. group of events).
There is differnet sessions to differnet customers.
We talking about millions sessions per day.

The key is *customer id - time stamp - sessions id.
*
So, yes it sorted by customer and date and as I want to remove rows by
customer and date - it sorted all right.
Actually the main motivation to remove old rows is that we have storage
limitations (and too much data...).

So, my concern if we can do something better than nightly/weekly map reduce
job that will ends up with a major compaction.
Ophir
PS
The majorty of my customers share the same retention policy but I still need
abilty to change it for a specific customer.


On Mon, May 9, 2011 at 6:48 PM, Ted Dunning <td...@maprtech.com> wrote:

> Can you say a bit more about your data organization?
>
> Are you storing transactions of some kind?   If so an your key involve
> time?
>  I think that putting some extract of time (day number perhaps) as a
> leading
>
> Are you storing profiles where the key is the user (or something) id and
> the
> data is essentially a list of transactions?  If so, can you segregate
> transactions into separate column families that can be dropped as data
> expires?
>
> When you say data expiration varies by customer, is that really necessary
> or
> can you have a lowest common denominator for actual deletions with rules
> that govern how much data is actually visible to the consumer of the data?
>
> On Mon, May 9, 2011 at 2:59 AM, Ophir Cohen <op...@gmail.com> wrote:
>
> > Hi All,
> > In my company currently we are working hard on deployment our cluster
> with
> > HBase.
> >
> > We talking of ~20 nodes to hold pretty big data (~1TB per day).
> >
> > As there is a lot of data, we need a retention method, i.e. a way to
> remove
> > old data.
> >
> > The problem is that I can't/want to do it using TTL cause two reasons:
> >
> >   1. Different retention policy for different customers.
> >   2. Policy might be changed.
> >
> >
> > Of course, I can do it using nightly (weekly?) MR job that runs on all
> data
> > and remove the old data.
> > There is few problems:
> >
> >   1. Running on huge amount of data only to remove small portion of it.
> >   2. It'll be a heavily MR job.
> >   3. Need to perform main compaction afterwards - that will affect
> >   performance or even stop service (is that right???).
> >
> > I might use BulkFileOutputFormat for that job - but still have those
> > problems.
> >
> > As my data sorted by the retention policies (customers and time) I
> thought
> > of this option:
> >
> >   1. Split regions and create region with 'candidates to removed'.
> >   2. Drop this region.
> >
> >
> >   - Is it possible to drop region?
> >   - Do you think it a good idea?
> >   - Any other ideas?
> >
> > Thanks,
> >
> > Ophir Cohen
> > LivePerson
> >
>

Re: Data retention in HBase

Posted by Ted Dunning <td...@maprtech.com>.
Can you say a bit more about your data organization?

Are you storing transactions of some kind?   If so an your key involve time?
 I think that putting some extract of time (day number perhaps) as a
leading

Are you storing profiles where the key is the user (or something) id and the
data is essentially a list of transactions?  If so, can you segregate
transactions into separate column families that can be dropped as data
expires?

When you say data expiration varies by customer, is that really necessary or
can you have a lowest common denominator for actual deletions with rules
that govern how much data is actually visible to the consumer of the data?

On Mon, May 9, 2011 at 2:59 AM, Ophir Cohen <op...@gmail.com> wrote:

> Hi All,
> In my company currently we are working hard on deployment our cluster with
> HBase.
>
> We talking of ~20 nodes to hold pretty big data (~1TB per day).
>
> As there is a lot of data, we need a retention method, i.e. a way to remove
> old data.
>
> The problem is that I can't/want to do it using TTL cause two reasons:
>
>   1. Different retention policy for different customers.
>   2. Policy might be changed.
>
>
> Of course, I can do it using nightly (weekly?) MR job that runs on all data
> and remove the old data.
> There is few problems:
>
>   1. Running on huge amount of data only to remove small portion of it.
>   2. It'll be a heavily MR job.
>   3. Need to perform main compaction afterwards - that will affect
>   performance or even stop service (is that right???).
>
> I might use BulkFileOutputFormat for that job - but still have those
> problems.
>
> As my data sorted by the retention policies (customers and time) I thought
> of this option:
>
>   1. Split regions and create region with 'candidates to removed'.
>   2. Drop this region.
>
>
>   - Is it possible to drop region?
>   - Do you think it a good idea?
>   - Any other ideas?
>
> Thanks,
>
> Ophir Cohen
> LivePerson
>

Re: Data retention in HBase

Posted by Stack <st...@duboce.net>.
What Ted says and then some comments inline below.

On Mon, May 9, 2011 at 2:59 AM, Ophir Cohen <op...@gmail.com> wrote:
>   3. Need to perform main compaction afterwards - that will affect
>   performance or even stop service (is that right???).
>

It will do the former.  It should nod to the latter.  Thats a problem
if it does.


> As my data sorted by the retention policies (customers and time) I thought
> of this option:
>
>   1. Split regions and create region with 'candidates to removed'.
>   2. Drop this region.
>
>
>   - Is it possible to drop region?
>   - Do you think it a good idea?
>

You could do this.  Downside is that its manual process -- is it? --
and other downside is that there are no tools currently to help you
with it; you'll have to craft them yourself.  Upside are that it
should not be hard.  You make sure that the region to remove is closed
on the hosting regionserver before you do anything.  You would then
remove its stores in the filesystem (i.e. you leave the region in
.META. -- you are removing its content in the fs but you need to have
the region closed when you do it else it'll go crazy when its removed
files go missing from under it).  You would then need to open the
region again.  The latter needs testing.  I'm not sure if it works.

St.Ack