You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Alex Baranau <al...@gmail.com> on 2012/07/09 21:36:14 UTC

Can manually remove HFiles (similar to bulk import, but bulk remove)?

Hello,

I wonder, for purging old data, if I'm OK with "remove all StoreFiles which
are older than ..." way, can I do that? To me it seems like this can be a
very effective way to remove old data, similar to fast bulk import
functionality, but for deletion.

Thank you,

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase

Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?

Posted by Jonathan Hsieh <jo...@cloudera.com>.

On Mon, Jul 9, 2012 at 1:05 PM, Alex Baranau <al...@gmail.com>wrote:

> Hey, this is closer!
>
> However, I think I'd want to avoid major compaction. In fact I was thinking
> about avoiding any compactions & splitting.
> ...

So, you are saying that major compaction will look at max/min ts metainfo
> of the HFile and will remove the whole file based on ttl if necessary
> (without going through the file)? Can I tell it not to actually compact
> other HFiles (i.e. leave them as is, otherwise it would be not as easy to
> remove HFiles again in an hour)? I.e. looks like "delete only whole HFiles
> based on TTL" functionality is wat I need here..
>
> Of the top of my head, I don't know how "smart" the major compaction code
is wrt to ttls.  I'm pretty sure it isn't smart enough to explicitly ignore
specific files.


> I fear that complexity with removing HFiles can be caused by (block) cache
> that may hold its information. Is that right? I'm actually OK with HBase to
> return me the data of files I "deleted" by removing HFiles: I will specify
> timerange on scans anyways (in this example to omit things older than 1
> week).
>
>
I'm not sure what the block cache eviction policy is when a single region
is closed, but it sounds like you are ok if stale data remains.

Sounds like you might want to try the close/delete/open advanced approach
on a test cluster to see if it meets your needs.

Jon.

-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?

Posted by Alex Baranau <al...@gmail.com>.

Thank you guys for the pointers/info! I'll try to make use of it. If it
turns out into smth (like script, etc.) re-usable I will open a JIRA issue
and add it for others to use.

Thanx again,
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase

On Wed, Jul 11, 2012 at 8:51 AM, Stack <st...@duboce.net> wrote:

> On Mon, Jul 9, 2012 at 10:05 PM, Alex Baranau <al...@gmail.com>
> wrote:
> > I fear that complexity with removing HFiles can be caused by (block)
> cache
> > that may hold its information. Is that right? I'm actually OK with HBase
> to
> > return me the data of files I "deleted" by removing HFiles: I will
> specify
> > timerange on scans anyways (in this example to omit things older than 1
> > week).
> >
>
> I think this is a use case we should support natively.  Someone around
> the corner from us was looking to do this.  They load a complete
> dataset each night and on the weekends they want to just drop the old
> stuff by removing the hfiles > N days.
>
> You could script it now.  Look at the hfiles in hdfs -- they have
> sufficient metadata IIRC -- and then do the prescription Jon suggests
> above of close, remove, and reopen.  We could add an API to do this;
> i.e. reread hdfs for hfiles (would be nice to do it 'atomically'
> telling the new API which to drop).
>
> You bring up block cache.  That should be fine.  We shouldn't be
> reading blocks for files that are no longer open.  Old blocks should
> get aged out.
>
> On compaction dropping complete hfiles if they are outside TTL, I'm
> not sure we have that (didn't look too closely).
>
> St.Ack
>

Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?

Posted by Stack <st...@duboce.net>.

On Mon, Jul 9, 2012 at 10:05 PM, Alex Baranau <al...@gmail.com> wrote:
> I fear that complexity with removing HFiles can be caused by (block) cache
> that may hold its information. Is that right? I'm actually OK with HBase to
> return me the data of files I "deleted" by removing HFiles: I will specify
> timerange on scans anyways (in this example to omit things older than 1
> week).
>

I think this is a use case we should support natively.  Someone around
the corner from us was looking to do this.  They load a complete
dataset each night and on the weekends they want to just drop the old
stuff by removing the hfiles > N days.

You could script it now.  Look at the hfiles in hdfs -- they have
sufficient metadata IIRC -- and then do the prescription Jon suggests
above of close, remove, and reopen.  We could add an API to do this;
i.e. reread hdfs for hfiles (would be nice to do it 'atomically'
telling the new API which to drop).

You bring up block cache.  That should be fine.  We shouldn't be
reading blocks for files that are no longer open.  Old blocks should
get aged out.

On compaction dropping complete hfiles if they are outside TTL, I'm
not sure we have that (didn't look too closely).

St.Ack

Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?

Posted by Alex Baranau <al...@gmail.com>.

Hey, this is closer!

However, I think I'd want to avoid major compaction. In fact I was thinking
about avoiding any compactions & splitting.
E.g. say I process some amount of data every 1 hour (e.g. with MR job), the
output is written as a set of HFiles and added to be served by HBase. At
the same time I care to keep only 1 week of data. In that case, ideally,
I'd like to do the following:
* pre-split the table with N regions, to be evenly distributed over the
cluster
* turn off minor/major compactions (it is OK for me to have 24*7 HFiles per
region, given one CF, and I know they will not exceed the region max size)
* periodically remove HFiles older than one week

By setting up table like this, I'd avoid unnecessary split operations,
compact operations, moving Regions (i.e. avoid redundant IO/CPU and
hopefully data locality breaking)

So, you are saying that major compaction will look at max/min ts metainfo
of the HFile and will remove the whole file based on ttl if necessary
(without going through the file)? Can I tell it not to actually compact
other HFiles (i.e. leave them as is, otherwise it would be not as easy to
remove HFiles again in an hour)? I.e. looks like "delete only whole HFiles
based on TTL" functionality is wat I need here..

I fear that complexity with removing HFiles can be caused by (block) cache
that may hold its information. Is that right? I'm actually OK with HBase to
return me the data of files I "deleted" by removing HFiles: I will specify
timerange on scans anyways (in this example to omit things older than 1
week).

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase

On Mon, Jul 9, 2012 at 3:44 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> You could set your ttls and trigger a major compaction ...
>
> Or, (this is pretty advanced) you can probably do it without taking down
> RS's by:
> 1) closing the region in the hbase shell
> 2) deleting the file in the shell
> 3) reopening the region in the hbase shell
>
> Jon.
>
> On Mon, Jul 9, 2012 at 12:41 PM, Alex Baranau <alex.baranov.v@gmail.com
> >wrote:
>
> > Heh, this is what I want to avoid actually: restarting RSs.
> >
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase
> >
> > On Mon, Jul 9, 2012 at 3:38 PM, Amandeep Khurana <am...@gmail.com>
> wrote:
> >
> > > I _think_ you should be able to do it and be just fine but you'll need
> to
> > > shut down the region servers before you remove and start them back up
> > after
> > > you are done. Someone else closer to the internals can confirm/deny
> this.
> > >
> > >
> > > On Monday, July 9, 2012 at 12:36 PM, Alex Baranau wrote:
> > >
> > > > Hello,
> > > >
> > > > I wonder, for purging old data, if I'm OK with "remove all StoreFiles
> > > which
> > > > are older than ..." way, can I do that? To me it seems like this can
> > be a
> > > > very effective way to remove old data, similar to fast bulk import
> > > > functionality, but for deletion.
> > > >
> > > > Thank you,
> > > >
> > > > Alex Baranau
> > > > ------
> > > > Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop -
> > HBase
> > > >
> > > >
> > >
> > >
> > >
> >
>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>

Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?

Posted by Jonathan Hsieh <jo...@cloudera.com>.

You could set your ttls and trigger a major compaction ...

Or, (this is pretty advanced) you can probably do it without taking down
RS's by:
1) closing the region in the hbase shell
2) deleting the file in the shell
3) reopening the region in the hbase shell

Jon.

On Mon, Jul 9, 2012 at 12:41 PM, Alex Baranau <al...@gmail.com>wrote:

> Heh, this is what I want to avoid actually: restarting RSs.
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase
>
> On Mon, Jul 9, 2012 at 3:38 PM, Amandeep Khurana <am...@gmail.com> wrote:
>
> > I _think_ you should be able to do it and be just fine but you'll need to
> > shut down the region servers before you remove and start them back up
> after
> > you are done. Someone else closer to the internals can confirm/deny this.
> >
> >
> > On Monday, July 9, 2012 at 12:36 PM, Alex Baranau wrote:
> >
> > > Hello,
> > >
> > > I wonder, for purging old data, if I'm OK with "remove all StoreFiles
> > which
> > > are older than ..." way, can I do that? To me it seems like this can
> be a
> > > very effective way to remove old data, similar to fast bulk import
> > > functionality, but for deletion.
> > >
> > > Thank you,
> > >
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop -
> HBase
> > >
> > >
> >
> >
> >
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?

Posted by Alex Baranau <al...@gmail.com>.

Heh, this is what I want to avoid actually: restarting RSs.

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase

On Mon, Jul 9, 2012 at 3:38 PM, Amandeep Khurana <am...@gmail.com> wrote:

> I _think_ you should be able to do it and be just fine but you'll need to
> shut down the region servers before you remove and start them back up after
> you are done. Someone else closer to the internals can confirm/deny this.
>
>
> On Monday, July 9, 2012 at 12:36 PM, Alex Baranau wrote:
>
> > Hello,
> >
> > I wonder, for purging old data, if I'm OK with "remove all StoreFiles
> which
> > are older than ..." way, can I do that? To me it seems like this can be a
> > very effective way to remove old data, similar to fast bulk import
> > functionality, but for deletion.
> >
> > Thank you,
> >
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase
> >
> >
>
>
>

Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?

Posted by Amandeep Khurana <am...@gmail.com>.

I _think_ you should be able to do it and be just fine but you'll need to shut down the region servers before you remove and start them back up after you are done. Someone else closer to the internals can confirm/deny this. 

On Monday, July 9, 2012 at 12:36 PM, Alex Baranau wrote:

> Hello,
> 
> I wonder, for purging old data, if I'm OK with "remove all StoreFiles which
> are older than ..." way, can I do that? To me it seems like this can be a
> very effective way to remove old data, similar to fast bulk import
> functionality, but for deletion.
> 
> Thank you,
> 
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase
> 
>