You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Boris Tyukin <bo...@boristyukin.com> on 2018/07/25 19:56:10 UTC

Re: swap data in Kudu table

Hi guys,

thanks again for your help!  I just blogged about this
https://boristyukin.com/how-to-hot-swap-apache-kudu-tables-with-apache-impala/

BTW I did not have to invalidate or refresh metadata - it just worked with
 ALTER TABLE TBLPROPERTIES idea. We have one Kudu master on our dev cluster
so not sure if it is because of that but Impala/Kudu docs also do not
mention anything about metadata refresh.  Looks like Impala is keeping a
reference to uuid of the Kudu table not its actual name.

One thing I am still puzzled is how Impala was able to finish my
long-running SELECT statement, that I had kicked off right before the swap.
I did not get any error messages and I could clearly see that Kudu tables
were getting renamed and dropped, while the query was still running in a
different session and completed 10 seconds after the swap. This is still a
mystery to me. The only explanation I have is that data was already in
Impala daemons memory and did not need Kudu tables at that point.

Boris



On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin <bo...@boristyukin.com> wrote:

> you are guys are awesome, thanks!
>
> Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week.
> Views might work as well but for a number of reasons want to keep it as my
> last resort :)
>
> On Fri, Feb 23, 2018 at 4:32 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> A couple other ideas from the Impala side:
>>
>> - could you use a view and alter the view to point to a different table?
>> Then all readers would be pointed at the view, and security permissions
>> could be on that view rather than the underlying tables?
>>
>> - I think if you use an external table in Impala you could use an ALTER
>> TABLE TBLPROPERTIES ... statement to change kudu.table_name to point to a
>> different table. Then issue a 'refresh' on the impalads so that they load
>> the new metadata. Subsequent queries would hit the new underlying Kudu
>> table, but permissions and stats would be unchanged.
>>
>> -Todd
>>
>> On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy <mp...@apache.org> wrote:
>>
>>> Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk
>>> load capabilities or staging abilities. Theoretically renaming a partition
>>> atomically shouldn't be that hard to implement, since it's just a master
>>> metadata operation which can be done atomically, but it's not yet
>>> implemented.
>>>
>>> There is a JIRA to track a generic bulk load API here:
>>> https://issues.apache.org/jira/browse/KUDU-1370
>>>
>>> Since I couldn't find anything to track the specific features you
>>> mentioned, I just filed the following improvement JIRAs so we can track it:
>>>
>>>    - KUDU-2326: Support atomic bulk load operation
>>>    <https://issues.apache.org/jira/browse/KUDU-2326>
>>>    - KUDU-2327: Support atomic swap of tables or partitions
>>>    <https://issues.apache.org/jira/browse/KUDU-2327>
>>>
>>> Mike
>>>
>>> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am trying to figure out the best and safest way to swap data in a
>>>> production Kudu table with data from a staging table.
>>>>
>>>> Basically, once in a while we need to perform a full reload of some
>>>> tables (once in a few months). These tables are pretty large with billions
>>>> of rows and we want to minimize the risk and downtime for users if
>>>> something bad happens in the middle of that process.
>>>>
>>>> With Hive and Impala on HDFS, we can use a very cool handy command LOAD
>>>> DATA INPATH. We can prepare data for reload in a staging table upfront and
>>>> this process might take many hours. Once staging table is ready, we can
>>>> issue LOAD DATA INPATH command which will move underlying HDFS files to a
>>>> production table - this operation is almost instant and the very last step
>>>> in our pipeline.
>>>>
>>>> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE
>>>> PARTITION command.
>>>>
>>>> Now with Kudu, I cannot seem to find a good strategy. The only thing
>>>> came to my mind is to drop the production table and rename a staging table
>>>> to production table as the last step of the job, but in this case we are
>>>> going to lose statistics and security permissions.
>>>>
>>>> Any other ideas?
>>>>
>>>> Thanks!
>>>> Boris
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>

Re: swap data in Kudu table

Posted by Boris <bo...@gmail.com>.
Thanks so much Tomas, glad you liked it. But as you might have seen another
thread already, the workaround I've described won't work with Impala 2.12
due to a breaking change.

On Thu, Aug 2, 2018, 07:18 farkas@tf-bic.sk <fa...@tf-bic.sk> wrote:

> Thanks Boris for a great article!
> Tomas
>
> On 2018/07/25 19:56:10, Boris Tyukin <bo...@boristyukin.com> wrote:
> > Hi guys,
> >
> > thanks again for your help!  I just blogged about this
> >
> https://boristyukin.com/how-to-hot-swap-apache-kudu-tables-with-apache-impala/
> >
> > BTW I did not have to invalidate or refresh metadata - it just worked
> with
> >  ALTER TABLE TBLPROPERTIES idea. We have one Kudu master on our dev
> cluster
> > so not sure if it is because of that but Impala/Kudu docs also do not
> > mention anything about metadata refresh.  Looks like Impala is keeping a
> > reference to uuid of the Kudu table not its actual name.
> >
> > One thing I am still puzzled is how Impala was able to finish my
> > long-running SELECT statement, that I had kicked off right before the
> swap.
> > I did not get any error messages and I could clearly see that Kudu tables
> > were getting renamed and dropped, while the query was still running in a
> > different session and completed 10 seconds after the swap. This is still
> a
> > mystery to me. The only explanation I have is that data was already in
> > Impala daemons memory and did not need Kudu tables at that point.
> >
> > Boris
> >
> >
> >
> > On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
> >
> > > you are guys are awesome, thanks!
> > >
> > > Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week.
> > > Views might work as well but for a number of reasons want to keep it
> as my
> > > last resort :)
> > >
> > > On Fri, Feb 23, 2018 at 4:32 PM, Todd Lipcon <to...@cloudera.com>
> wrote:
> > >
> > >> A couple other ideas from the Impala side:
> > >>
> > >> - could you use a view and alter the view to point to a different
> table?
> > >> Then all readers would be pointed at the view, and security
> permissions
> > >> could be on that view rather than the underlying tables?
> > >>
> > >> - I think if you use an external table in Impala you could use an
> ALTER
> > >> TABLE TBLPROPERTIES ... statement to change kudu.table_name to point
> to a
> > >> different table. Then issue a 'refresh' on the impalads so that they
> load
> > >> the new metadata. Subsequent queries would hit the new underlying Kudu
> > >> table, but permissions and stats would be unchanged.
> > >>
> > >> -Todd
> > >>
> > >> On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy <mp...@apache.org>
> wrote:
> > >>
> > >>> Hi Boris, those are good ideas. Currently Kudu does not have atomic
> bulk
> > >>> load capabilities or staging abilities. Theoretically renaming a
> partition
> > >>> atomically shouldn't be that hard to implement, since it's just a
> master
> > >>> metadata operation which can be done atomically, but it's not yet
> > >>> implemented.
> > >>>
> > >>> There is a JIRA to track a generic bulk load API here:
> > >>> https://issues.apache.org/jira/browse/KUDU-1370
> > >>>
> > >>> Since I couldn't find anything to track the specific features you
> > >>> mentioned, I just filed the following improvement JIRAs so we can
> track it:
> > >>>
> > >>>    - KUDU-2326: Support atomic bulk load operation
> > >>>    <https://issues.apache.org/jira/browse/KUDU-2326>
> > >>>    - KUDU-2327: Support atomic swap of tables or partitions
> > >>>    <https://issues.apache.org/jira/browse/KUDU-2327>
> > >>>
> > >>> Mike
> > >>>
> > >>> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin <boris@boristyukin.com
> >
> > >>> wrote:
> > >>>
> > >>>> Hello,
> > >>>>
> > >>>> I am trying to figure out the best and safest way to swap data in a
> > >>>> production Kudu table with data from a staging table.
> > >>>>
> > >>>> Basically, once in a while we need to perform a full reload of some
> > >>>> tables (once in a few months). These tables are pretty large with
> billions
> > >>>> of rows and we want to minimize the risk and downtime for users if
> > >>>> something bad happens in the middle of that process.
> > >>>>
> > >>>> With Hive and Impala on HDFS, we can use a very cool handy command
> LOAD
> > >>>> DATA INPATH. We can prepare data for reload in a staging table
> upfront and
> > >>>> this process might take many hours. Once staging table is ready, we
> can
> > >>>> issue LOAD DATA INPATH command which will move underlying HDFS
> files to a
> > >>>> production table - this operation is almost instant and the very
> last step
> > >>>> in our pipeline.
> > >>>>
> > >>>> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE
> > >>>> PARTITION command.
> > >>>>
> > >>>> Now with Kudu, I cannot seem to find a good strategy. The only thing
> > >>>> came to my mind is to drop the production table and rename a
> staging table
> > >>>> to production table as the last step of the job, but in this case
> we are
> > >>>> going to lose statistics and security permissions.
> > >>>>
> > >>>> Any other ideas?
> > >>>>
> > >>>> Thanks!
> > >>>> Boris
> > >>>>
> > >>>
> > >>>
> > >>
> > >>
> > >> --
> > >> Todd Lipcon
> > >> Software Engineer, Cloudera
> > >>
> > >
> > >
> >
>

Re: swap data in Kudu table

Posted by fa...@tf-bic.sk, fa...@tf-bic.sk.
Thanks Boris for a great article!
Tomas

On 2018/07/25 19:56:10, Boris Tyukin <bo...@boristyukin.com> wrote: 
> Hi guys,
> 
> thanks again for your help!  I just blogged about this
> https://boristyukin.com/how-to-hot-swap-apache-kudu-tables-with-apache-impala/
> 
> BTW I did not have to invalidate or refresh metadata - it just worked with
>  ALTER TABLE TBLPROPERTIES idea. We have one Kudu master on our dev cluster
> so not sure if it is because of that but Impala/Kudu docs also do not
> mention anything about metadata refresh.  Looks like Impala is keeping a
> reference to uuid of the Kudu table not its actual name.
> 
> One thing I am still puzzled is how Impala was able to finish my
> long-running SELECT statement, that I had kicked off right before the swap.
> I did not get any error messages and I could clearly see that Kudu tables
> were getting renamed and dropped, while the query was still running in a
> different session and completed 10 seconds after the swap. This is still a
> mystery to me. The only explanation I have is that data was already in
> Impala daemons memory and did not need Kudu tables at that point.
> 
> Boris
> 
> 
> 
> On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin <bo...@boristyukin.com> wrote:
> 
> > you are guys are awesome, thanks!
> >
> > Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week.
> > Views might work as well but for a number of reasons want to keep it as my
> > last resort :)
> >
> > On Fri, Feb 23, 2018 at 4:32 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> >> A couple other ideas from the Impala side:
> >>
> >> - could you use a view and alter the view to point to a different table?
> >> Then all readers would be pointed at the view, and security permissions
> >> could be on that view rather than the underlying tables?
> >>
> >> - I think if you use an external table in Impala you could use an ALTER
> >> TABLE TBLPROPERTIES ... statement to change kudu.table_name to point to a
> >> different table. Then issue a 'refresh' on the impalads so that they load
> >> the new metadata. Subsequent queries would hit the new underlying Kudu
> >> table, but permissions and stats would be unchanged.
> >>
> >> -Todd
> >>
> >> On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy <mp...@apache.org> wrote:
> >>
> >>> Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk
> >>> load capabilities or staging abilities. Theoretically renaming a partition
> >>> atomically shouldn't be that hard to implement, since it's just a master
> >>> metadata operation which can be done atomically, but it's not yet
> >>> implemented.
> >>>
> >>> There is a JIRA to track a generic bulk load API here:
> >>> https://issues.apache.org/jira/browse/KUDU-1370
> >>>
> >>> Since I couldn't find anything to track the specific features you
> >>> mentioned, I just filed the following improvement JIRAs so we can track it:
> >>>
> >>>    - KUDU-2326: Support atomic bulk load operation
> >>>    <https://issues.apache.org/jira/browse/KUDU-2326>
> >>>    - KUDU-2327: Support atomic swap of tables or partitions
> >>>    <https://issues.apache.org/jira/browse/KUDU-2327>
> >>>
> >>> Mike
> >>>
> >>> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin <bo...@boristyukin.com>
> >>> wrote:
> >>>
> >>>> Hello,
> >>>>
> >>>> I am trying to figure out the best and safest way to swap data in a
> >>>> production Kudu table with data from a staging table.
> >>>>
> >>>> Basically, once in a while we need to perform a full reload of some
> >>>> tables (once in a few months). These tables are pretty large with billions
> >>>> of rows and we want to minimize the risk and downtime for users if
> >>>> something bad happens in the middle of that process.
> >>>>
> >>>> With Hive and Impala on HDFS, we can use a very cool handy command LOAD
> >>>> DATA INPATH. We can prepare data for reload in a staging table upfront and
> >>>> this process might take many hours. Once staging table is ready, we can
> >>>> issue LOAD DATA INPATH command which will move underlying HDFS files to a
> >>>> production table - this operation is almost instant and the very last step
> >>>> in our pipeline.
> >>>>
> >>>> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE
> >>>> PARTITION command.
> >>>>
> >>>> Now with Kudu, I cannot seem to find a good strategy. The only thing
> >>>> came to my mind is to drop the production table and rename a staging table
> >>>> to production table as the last step of the job, but in this case we are
> >>>> going to lose statistics and security permissions.
> >>>>
> >>>> Any other ideas?
> >>>>
> >>>> Thanks!
> >>>> Boris
> >>>>
> >>>
> >>>
> >>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
> >
> >
>