You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2016/02/15 01:18:59 UTC

Rename tables or swap alias

I use Spark to take an old table, clean it up to create an RDD of cleaned data. What I’d like to do is write all of the data to a new table in HBase, then rename the table to the old name. If possible it could be done by changing an alias to point to the new table as long as all external code uses the alias, or by a 2 table rename operation. But I don’t see how to do this for HBase. I am dealing with a lot of data so don’t want to do table modifications with deletes and upserts, this would be incredibly slow. Furthermore I don’t want to disable the table for more than a tiny span of time.

Is it possible to have 2 tables and rename both in an atomic action, or change some alias to point to the new table in an atomic action. If not what is the quickest way to achieve this to minimize time disabled.

Re: Rename tables or swap alias

Posted by Pat Ferrel <pa...@occamsmachete.com>.

We implemented this by upserting changed elements and dropping others. On a given cluster is takes 4.5 hours to load HBase, the trim and cleanup as currently implemented takes 4 days. Back to the drawing board.

I’ve read the references but still don’t grok what to do. I have a table with an event stream, containing duplicates and expired data. I’d like to find the most time-efficient way to remove duplicates and drop expired data from what I’ll call the main_table. This is being queried and added to all the time.

My first thought was to create a new clean_table with Spark by reading main_table, processing and writing clean_table then renaming main_table to old_table, and renaming clean_table to main_table. I can now drop old_table. Ignoring what happens to events during renaming, this would be efficient because it would be equivalent to loading, no complex updates to tables in place and under load. 

Snapshots and clones seem to miss the issue which is writing the cleaned data to some place that can now act like main_table but clearly I don’t understand snapshots and clones. They seem to be some way to alias a table so only changes are logged, without actually copying the data. I’m not sure i care about copying the data into an RDD, which will then undergo some transforms into a final RDD. This can be written efficiently into clean_table with no upserts or droping of elements, which seems to be cause things to slow to a halt.

So assuming I have clean_table, how do I get all queries to go to it, instead of main_table? Elasticsearch has an alias that I can just point somewhere new. Do I need to keep track of something like this outside of HBase and change it after creating clean_table or am I missing how to do this with shapshots and clones?

From: Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>>
Subject: Re: Rename tables or swap alias
Date: February 16, 2016 at 6:48:53 AM PST
To: "user@hbase.apache.org <ma...@hbase.apache.org>" <user@hbase.apache.org <ma...@hbase.apache.org>>
Reply-To: user@hbase.apache.org <ma...@hbase.apache.org>

Please see http://hbase.apache.org/book.html#ops.snapshots <http://hbase.apache.org/book.html#ops.snapshots> for background
on snapshots.

In Anil's description, table_old is the result of cloning the snapshot
which is taken in step #1. See
http://hbase.apache.org/book.html#ops.snapshots.clone <http://hbase.apache.org/book.html#ops.snapshots.clone>

Cheers

On Tue, Feb 16, 2016 at 6:35 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I think I can work out the algorithm if I knew precisely what a “snapshot"
> does. From my reading it seems to be a lightweight fast alias (for lack of
> a better word) since it creates something that refers to the same physical
> data.So if I create a new table with cleaned data, call it table_new. Then
> I drop table_old and “snapshot” table_new into table_old? Is this what is
> suggested?
> 
> This leaves me with a small time where there is no table_old, which is the
> time between dropping table_old and creating a snapshot. Is it feasible to
> lock the DB for this time?
> 
>> On Feb 15, 2016, at 7:13 PM, Ted Yu <yu...@gmail.com> wrote:
>> 
>> Keep in mind that if the writes to this table are not paused, there would
>> be some data coming in between steps #1 and #2 which would not be in the
>> snapshot.
>> 
>> Cheers
>> 
>> On Mon, Feb 15, 2016 at 6:21 PM, Anil Gupta <an...@gmail.com>
> wrote:
>> 
>>> I dont think there is any atomic operations in hbase to support ddl
> across
>>> 2 tables.
>>> 
>>> But, maybe you can use hbase snapshots.
>>> 1.Create a hbase snapshot.
>>> 2.Truncate the table.
>>> 3.Write data to the table.
>>> 4.Create a table from snapshot taken in step #1 as table_old.
>>> 
>>> Now you have two tables. One with current run data and other with last
> run
>>> data.
>>> I think above process will suffice. But, keep in mind that it is not
>>> atomic.
>>> 
>>> HTH,
>>> Anil
>>> Sent from my iPhone
>>> 
>>>> On Feb 15, 2016, at 4:25 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>> 
>>>> Any other way to do what I was asking. With Spark this is a very normal
>>> thing to treat a table as immutable and create another to replace the
> old.
>>>> 
>>>> Can you lock two tables and rename them in 2 actions then unlock in a
>>> very short period of time?
>>>> 
>>>> Or an alias for table names?
>>>> 
>>>> Didn’t see these in any docs or Googling, any help is appreciated.
>>> Writing all this data back to the original table would be a huge load
> on a
>>> table being written to by external processes and therefore under large
> load
>>> to begin with.
>>>> 
>>>>> On Feb 14, 2016, at 5:03 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>> 
>>>>> There is currently no native support for renaming two tables in one
>>> atomic
>>>>> action.
>>>>> 
>>>>> FYI
>>>>> 
>>>>>> On Sun, Feb 14, 2016 at 4:18 PM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>>>> 
>>>>>> I use Spark to take an old table, clean it up to create an RDD of
>>> cleaned
>>>>>> data. What I’d like to do is write all of the data to a new table in
>>> HBase,
>>>>>> then rename the table to the old name. If possible it could be done
> by
>>>>>> changing an alias to point to the new table as long as all external
>>> code
>>>>>> uses the alias, or by a 2 table rename operation. But I don’t see how
>>> to do
>>>>>> this for HBase. I am dealing with a lot of data so don’t want to do
>>> table
>>>>>> modifications with deletes and upserts, this would be incredibly
> slow.
>>>>>> Furthermore I don’t want to disable the table for more than a tiny
>>> span of
>>>>>> time.
>>>>>> 
>>>>>> Is it possible to have 2 tables and rename both in an atomic action,
> or
>>>>>> change some alias to point to the new table in an atomic action. If
> not
>>>>>> what is the quickest way to achieve this to minimize time disabled.
>>>> 
>>> 
> 
>

Re: Rename tables or swap alias

Posted by Ted Yu <yu...@gmail.com>.

Please see http://hbase.apache.org/book.html#ops.snapshots for background
on snapshots.

In Anil's description, table_old is the result of cloning the snapshot
which is taken in step #1. See
http://hbase.apache.org/book.html#ops.snapshots.clone

Cheers

On Tue, Feb 16, 2016 at 6:35 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I think I can work out the algorithm if I knew precisely what a “snapshot"
> does. From my reading it seems to be a lightweight fast alias (for lack of
> a better word) since it creates something that refers to the same physical
> data.So if I create a new table with cleaned data, call it table_new. Then
> I drop table_old and “snapshot” table_new into table_old? Is this what is
> suggested?
>
> This leaves me with a small time where there is no table_old, which is the
> time between dropping table_old and creating a snapshot. Is it feasible to
> lock the DB for this time?
>
> > On Feb 15, 2016, at 7:13 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > Keep in mind that if the writes to this table are not paused, there would
> > be some data coming in between steps #1 and #2 which would not be in the
> > snapshot.
> >
> > Cheers
> >
> > On Mon, Feb 15, 2016 at 6:21 PM, Anil Gupta <an...@gmail.com>
> wrote:
> >
> >> I dont think there is any atomic operations in hbase to support ddl
> across
> >> 2 tables.
> >>
> >> But, maybe you can use hbase snapshots.
> >> 1.Create a hbase snapshot.
> >> 2.Truncate the table.
> >> 3.Write data to the table.
> >> 4.Create a table from snapshot taken in step #1 as table_old.
> >>
> >> Now you have two tables. One with current run data and other with last
> run
> >> data.
> >> I think above process will suffice. But, keep in mind that it is not
> >> atomic.
> >>
> >> HTH,
> >> Anil
> >> Sent from my iPhone
> >>
> >>> On Feb 15, 2016, at 4:25 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >>>
> >>> Any other way to do what I was asking. With Spark this is a very normal
> >> thing to treat a table as immutable and create another to replace the
> old.
> >>>
> >>> Can you lock two tables and rename them in 2 actions then unlock in a
> >> very short period of time?
> >>>
> >>> Or an alias for table names?
> >>>
> >>> Didn’t see these in any docs or Googling, any help is appreciated.
> >> Writing all this data back to the original table would be a huge load
> on a
> >> table being written to by external processes and therefore under large
> load
> >> to begin with.
> >>>
> >>>> On Feb 14, 2016, at 5:03 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>>
> >>>> There is currently no native support for renaming two tables in one
> >> atomic
> >>>> action.
> >>>>
> >>>> FYI
> >>>>
> >>>>> On Sun, Feb 14, 2016 at 4:18 PM, Pat Ferrel <pa...@occamsmachete.com>
> >> wrote:
> >>>>>
> >>>>> I use Spark to take an old table, clean it up to create an RDD of
> >> cleaned
> >>>>> data. What I’d like to do is write all of the data to a new table in
> >> HBase,
> >>>>> then rename the table to the old name. If possible it could be done
> by
> >>>>> changing an alias to point to the new table as long as all external
> >> code
> >>>>> uses the alias, or by a 2 table rename operation. But I don’t see how
> >> to do
> >>>>> this for HBase. I am dealing with a lot of data so don’t want to do
> >> table
> >>>>> modifications with deletes and upserts, this would be incredibly
> slow.
> >>>>> Furthermore I don’t want to disable the table for more than a tiny
> >> span of
> >>>>> time.
> >>>>>
> >>>>> Is it possible to have 2 tables and rename both in an atomic action,
> or
> >>>>> change some alias to point to the new table in an atomic action. If
> not
> >>>>> what is the quickest way to achieve this to minimize time disabled.
> >>>
> >>
>
>

Re: Rename tables or swap alias

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I think I can work out the algorithm if I knew precisely what a “snapshot" does. From my reading it seems to be a lightweight fast alias (for lack of a better word) since it creates something that refers to the same physical data.So if I create a new table with cleaned data, call it table_new. Then I drop table_old and “snapshot” table_new into table_old? Is this what is suggested?

This leaves me with a small time where there is no table_old, which is the time between dropping table_old and creating a snapshot. Is it feasible to lock the DB for this time?

> On Feb 15, 2016, at 7:13 PM, Ted Yu <yu...@gmail.com> wrote:
> 
> Keep in mind that if the writes to this table are not paused, there would
> be some data coming in between steps #1 and #2 which would not be in the
> snapshot.
> 
> Cheers
> 
> On Mon, Feb 15, 2016 at 6:21 PM, Anil Gupta <an...@gmail.com> wrote:
> 
>> I dont think there is any atomic operations in hbase to support ddl across
>> 2 tables.
>> 
>> But, maybe you can use hbase snapshots.
>> 1.Create a hbase snapshot.
>> 2.Truncate the table.
>> 3.Write data to the table.
>> 4.Create a table from snapshot taken in step #1 as table_old.
>> 
>> Now you have two tables. One with current run data and other with last run
>> data.
>> I think above process will suffice. But, keep in mind that it is not
>> atomic.
>> 
>> HTH,
>> Anil
>> Sent from my iPhone
>> 
>>> On Feb 15, 2016, at 4:25 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>> 
>>> Any other way to do what I was asking. With Spark this is a very normal
>> thing to treat a table as immutable and create another to replace the old.
>>> 
>>> Can you lock two tables and rename them in 2 actions then unlock in a
>> very short period of time?
>>> 
>>> Or an alias for table names?
>>> 
>>> Didn’t see these in any docs or Googling, any help is appreciated.
>> Writing all this data back to the original table would be a huge load on a
>> table being written to by external processes and therefore under large load
>> to begin with.
>>> 
>>>> On Feb 14, 2016, at 5:03 PM, Ted Yu <yu...@gmail.com> wrote:
>>>> 
>>>> There is currently no native support for renaming two tables in one
>> atomic
>>>> action.
>>>> 
>>>> FYI
>>>> 
>>>>> On Sun, Feb 14, 2016 at 4:18 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>>>> 
>>>>> I use Spark to take an old table, clean it up to create an RDD of
>> cleaned
>>>>> data. What I’d like to do is write all of the data to a new table in
>> HBase,
>>>>> then rename the table to the old name. If possible it could be done by
>>>>> changing an alias to point to the new table as long as all external
>> code
>>>>> uses the alias, or by a 2 table rename operation. But I don’t see how
>> to do
>>>>> this for HBase. I am dealing with a lot of data so don’t want to do
>> table
>>>>> modifications with deletes and upserts, this would be incredibly slow.
>>>>> Furthermore I don’t want to disable the table for more than a tiny
>> span of
>>>>> time.
>>>>> 
>>>>> Is it possible to have 2 tables and rename both in an atomic action, or
>>>>> change some alias to point to the new table in an atomic action. If not
>>>>> what is the quickest way to achieve this to minimize time disabled.
>>> 
>>

Re: Rename tables or swap alias

Posted by Ted Yu <yu...@gmail.com>.

Keep in mind that if the writes to this table are not paused, there would
be some data coming in between steps #1 and #2 which would not be in the
snapshot.

Cheers

On Mon, Feb 15, 2016 at 6:21 PM, Anil Gupta <an...@gmail.com> wrote:

> I dont think there is any atomic operations in hbase to support ddl across
> 2 tables.
>
> But, maybe you can use hbase snapshots.
> 1.Create a hbase snapshot.
> 2.Truncate the table.
> 3.Write data to the table.
> 4.Create a table from snapshot taken in step #1 as table_old.
>
> Now you have two tables. One with current run data and other with last run
> data.
> I think above process will suffice. But, keep in mind that it is not
> atomic.
>
> HTH,
> Anil
> Sent from my iPhone
>
> > On Feb 15, 2016, at 4:25 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >
> > Any other way to do what I was asking. With Spark this is a very normal
> thing to treat a table as immutable and create another to replace the old.
> >
> > Can you lock two tables and rename them in 2 actions then unlock in a
> very short period of time?
> >
> > Or an alias for table names?
> >
> > Didn’t see these in any docs or Googling, any help is appreciated.
> Writing all this data back to the original table would be a huge load on a
> table being written to by external processes and therefore under large load
> to begin with.
> >
> >> On Feb 14, 2016, at 5:03 PM, Ted Yu <yu...@gmail.com> wrote:
> >>
> >> There is currently no native support for renaming two tables in one
> atomic
> >> action.
> >>
> >> FYI
> >>
> >>> On Sun, Feb 14, 2016 at 4:18 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >>>
> >>> I use Spark to take an old table, clean it up to create an RDD of
> cleaned
> >>> data. What I’d like to do is write all of the data to a new table in
> HBase,
> >>> then rename the table to the old name. If possible it could be done by
> >>> changing an alias to point to the new table as long as all external
> code
> >>> uses the alias, or by a 2 table rename operation. But I don’t see how
> to do
> >>> this for HBase. I am dealing with a lot of data so don’t want to do
> table
> >>> modifications with deletes and upserts, this would be incredibly slow.
> >>> Furthermore I don’t want to disable the table for more than a tiny
> span of
> >>> time.
> >>>
> >>> Is it possible to have 2 tables and rename both in an atomic action, or
> >>> change some alias to point to the new table in an atomic action. If not
> >>> what is the quickest way to achieve this to minimize time disabled.
> >
>

Re: Rename tables or swap alias

Posted by Anil Gupta <an...@gmail.com>.

I dont think there is any atomic operations in hbase to support ddl across 2 tables.

But, maybe you can use hbase snapshots.
1.Create a hbase snapshot.
2.Truncate the table.
3.Write data to the table.
4.Create a table from snapshot taken in step #1 as table_old.

Now you have two tables. One with current run data and other with last run data.
I think above process will suffice. But, keep in mind that it is not atomic.

HTH,
Anil
Sent from my iPhone

> On Feb 15, 2016, at 4:25 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> Any other way to do what I was asking. With Spark this is a very normal thing to treat a table as immutable and create another to replace the old.
> 
> Can you lock two tables and rename them in 2 actions then unlock in a very short period of time?
> 
> Or an alias for table names?
> 
> Didn’t see these in any docs or Googling, any help is appreciated. Writing all this data back to the original table would be a huge load on a table being written to by external processes and therefore under large load to begin with.
> 
>> On Feb 14, 2016, at 5:03 PM, Ted Yu <yu...@gmail.com> wrote:
>> 
>> There is currently no native support for renaming two tables in one atomic
>> action.
>> 
>> FYI
>> 
>>> On Sun, Feb 14, 2016 at 4:18 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>> 
>>> I use Spark to take an old table, clean it up to create an RDD of cleaned
>>> data. What I’d like to do is write all of the data to a new table in HBase,
>>> then rename the table to the old name. If possible it could be done by
>>> changing an alias to point to the new table as long as all external code
>>> uses the alias, or by a 2 table rename operation. But I don’t see how to do
>>> this for HBase. I am dealing with a lot of data so don’t want to do table
>>> modifications with deletes and upserts, this would be incredibly slow.
>>> Furthermore I don’t want to disable the table for more than a tiny span of
>>> time.
>>> 
>>> Is it possible to have 2 tables and rename both in an atomic action, or
>>> change some alias to point to the new table in an atomic action. If not
>>> what is the quickest way to achieve this to minimize time disabled.
>

Re: Rename tables or swap alias

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Any other way to do what I was asking. With Spark this is a very normal thing to treat a table as immutable and create another to replace the old.

Can you lock two tables and rename them in 2 actions then unlock in a very short period of time?

Or an alias for table names?

Didn’t see these in any docs or Googling, any help is appreciated. Writing all this data back to the original table would be a huge load on a table being written to by external processes and therefore under large load to begin with.

> On Feb 14, 2016, at 5:03 PM, Ted Yu <yu...@gmail.com> wrote:
> 
> There is currently no native support for renaming two tables in one atomic
> action.
> 
> FYI
> 
> On Sun, Feb 14, 2016 at 4:18 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> I use Spark to take an old table, clean it up to create an RDD of cleaned
>> data. What I’d like to do is write all of the data to a new table in HBase,
>> then rename the table to the old name. If possible it could be done by
>> changing an alias to point to the new table as long as all external code
>> uses the alias, or by a 2 table rename operation. But I don’t see how to do
>> this for HBase. I am dealing with a lot of data so don’t want to do table
>> modifications with deletes and upserts, this would be incredibly slow.
>> Furthermore I don’t want to disable the table for more than a tiny span of
>> time.
>> 
>> Is it possible to have 2 tables and rename both in an atomic action, or
>> change some alias to point to the new table in an atomic action. If not
>> what is the quickest way to achieve this to minimize time disabled.

Re: Rename tables or swap alias

Posted by Ted Yu <yu...@gmail.com>.

There is currently no native support for renaming two tables in one atomic
action.

FYI

On Sun, Feb 14, 2016 at 4:18 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I use Spark to take an old table, clean it up to create an RDD of cleaned
> data. What I’d like to do is write all of the data to a new table in HBase,
> then rename the table to the old name. If possible it could be done by
> changing an alias to point to the new table as long as all external code
> uses the alias, or by a 2 table rename operation. But I don’t see how to do
> this for HBase. I am dealing with a lot of data so don’t want to do table
> modifications with deletes and upserts, this would be incredibly slow.
> Furthermore I don’t want to disable the table for more than a tiny span of
> time.
>
> Is it possible to have 2 tables and rename both in an atomic action, or
> change some alias to point to the new table in an atomic action. If not
> what is the quickest way to achieve this to minimize time disabled.