You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Thai Bui <bl...@gmail.com> on 2019/04/24 19:03:50 UTC

A proposal for read-only external table for cloud-native Hive deployment

Hello all,

Hive 3 has brought significant changes to the community with the support
for ACID tables as default managed tables. With ACID tables, we can use
features such as materialized views, query result caching for BI tools and
more. But without ACID tables such as external tables, Hive doesn't support
any of these advanced features which makes a majority of cloud-native users
like me sad :(.

I propose we should support a more limited version of read-only external
tables such that materialized views and query result caching would work.
For example:

CREATE EXTERNAL TABLE table_name (..) STORED AS ORC
LOCATION 's3://some-bucket/some-dir'
TBLPROPERTIES ('read-only': "true");

In such tables, any data modification operations such as INSERT and UPDATE
would fail and DDL operations that "add" or "remove" partitions to the
table would succeed such as "ALTER TABLE ... ADD PARTITION". This would
make it possible for Hive to invalidate the cache and materialized views
even when the table is an external table.

Let me know what do you guys think and maybe I can start writing a wiki
document describing the approach in greater details.

Thanks,
Thai

Re: A proposal for read-only external table for cloud-native Hive deployment

Posted by Thai Bui <bl...@gmail.com>.

Hello all -- This thread is old but I just wanted to get an update with
newer information and not spam the dev forum with too much information.

To recap: My previous discussion was about proposing read-only transaction
support for Hive using external tables. This could be supported using
insert-only tables managed by Hive which requires a 1-time reingestion.
However, the recent development of the Delta Lake initiative
<https://delta.io/> from DataBricks is getting traction and it could work
as a neutral standard for big data tools to support ACID transactions
natively on the cloud for external tables (not only Hive). Have the
community considered supporting this option? And what would it take to have
Hive support this?

This has several advantages over the current approach:

1. ACID transactions support is now possible with external tables, across
different tools.
2. The metadata can be externalized and thus is more scalable that can
support millions of partitions (currently, my company existing metastores
struggle on a few hundred thousand to a million external partitions per
table backed by RDS).

It would be great if Hive can support this project for the aforementioned
reasons and of course, the support for ORC format would be really good as
well since Delta Lake only supports Parquet as of today.

Thai

On Fri, Apr 26, 2019 at 11:31 AM Thai Bui <bl...@gmail.com> wrote:

> My suggestion does require a change to your ETL process, but it doesn't
>> require you to copy the data into HDFS or to create storage clusters.
>> Hive
>> managed tables can reside in S3 with no problem.
>
>
> Thanks for pointing this out. I totally forget that managed tables could
> have a location externally specified. I think we can cope with this
> approach in the short-term but in the long-term, a more ETL-less approach
> is much more preferable with read-only transactional support for external
> tables. Mainly to avoid duplicate copies of data.
>
> This is actually a common ask when it comes to OnPrem -> Cloud REPL
>> streams, to avoid diverging.
>> The replicated data having its own updates is very problematic for CDC
>> style ACID replication into the cloud.
>
>
> It's a common problem when the pattern is replicating data everywhere and
> the users (such as analysts) don't know its full implications, which we are
> trying to avoid in the first place. But sometime, it's unavoidable if you
> are going on-prem -> cloud. With ACID support for read-only tables though,
> we'll give the users an option to "try it out" before fully commit to an
> ETL process to copy/optimize the data.
>
> On Thu, Apr 25, 2019 at 4:54 PM Gopal Vijayaraghavan <go...@apache.org>
> wrote:
>
>> >    reuse the transactional_properties and add 'read_only' as a new
>> value. With
>> >    read-only tables, all INSERT, UPDATE, DELETE statements will fail at
>> Hive
>> >    front-end.
>>
>> This is actually a common ask when it comes to OnPrem -> Cloud REPL
>> streams, to avoid diverging.
>>
>> The replicated data having its own updates is very problematic for CDC
>> style ACID replication into the cloud.
>>
>> Ranger authorization works great for this, though it is all-or-nothing
>> right now.
>>
>> At some point in the future, I wish I could lock up specific fields from
>> being updated in ACID.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>
> --
> Thai
>

-- 
Thai

Re: A proposal for read-only external table for cloud-native Hive deployment

Posted by Thai Bui <bl...@gmail.com>.

>
> My suggestion does require a change to your ETL process, but it doesn't
> require you to copy the data into HDFS or to create storage clusters.  Hive
> managed tables can reside in S3 with no problem.

Thanks for pointing this out. I totally forget that managed tables could
have a location externally specified. I think we can cope with this
approach in the short-term but in the long-term, a more ETL-less approach
is much more preferable with read-only transactional support for external
tables. Mainly to avoid duplicate copies of data.

This is actually a common ask when it comes to OnPrem -> Cloud REPL
> streams, to avoid diverging.
> The replicated data having its own updates is very problematic for CDC
> style ACID replication into the cloud.

It's a common problem when the pattern is replicating data everywhere and
the users (such as analysts) don't know its full implications, which we are
trying to avoid in the first place. But sometime, it's unavoidable if you
are going on-prem -> cloud. With ACID support for read-only tables though,
we'll give the users an option to "try it out" before fully commit to an
ETL process to copy/optimize the data.

On Thu, Apr 25, 2019 at 4:54 PM Gopal Vijayaraghavan <go...@apache.org>
wrote:

> >    reuse the transactional_properties and add 'read_only' as a new
> value. With
> >    read-only tables, all INSERT, UPDATE, DELETE statements will fail at
> Hive
> >    front-end.
>
> This is actually a common ask when it comes to OnPrem -> Cloud REPL
> streams, to avoid diverging.
>
> The replicated data having its own updates is very problematic for CDC
> style ACID replication into the cloud.
>
> Ranger authorization works great for this, though it is all-or-nothing
> right now.
>
> At some point in the future, I wish I could lock up specific fields from
> being updated in ACID.
>
> Cheers,
> Gopal
>
>
>

-- 
Thai

Re: A proposal for read-only external table for cloud-native Hive deployment

Posted by Gopal Vijayaraghavan <go...@apache.org>.

>    reuse the transactional_properties and add 'read_only' as a new value. With
>    read-only tables, all INSERT, UPDATE, DELETE statements will fail at Hive
>    front-end. 

This is actually a common ask when it comes to OnPrem -> Cloud REPL streams, to avoid diverging.

The replicated data having its own updates is very problematic for CDC style ACID replication into the cloud.

Ranger authorization works great for this, though it is all-or-nothing right now.

At some point in the future, I wish I could lock up specific fields from being updated in ACID.

Cheers,
Gopal

Re: A proposal for read-only external table for cloud-native Hive deployment

Posted by Alan Gates <al...@gmail.com>.

My suggestion does require a change to your ETL process, but it doesn't
require you to copy the data into HDFS or to create storage clusters.  Hive
managed tables can reside in S3 with no problem.

Alan.

On Thu, Apr 25, 2019 at 2:18 PM Thai Bui <bl...@gmail.com> wrote:

> Your suggested workflow will work and it would require us to re-ETL data
> from S3 to all over the place to multiple clusters. This is a cumbersome
> approach since most of our data reside on S3 and clusters are somewhat
> transient in nature (in the order of a few months for a redeployment &
> don't have large HDFS capacity).
>
> We do scale clusters up and down for compute but not for storage since HDFS
> is not easy to be scaled down on demand. It would be much more preferable
> in this architecture to have Hive behaves as a pure compute engine that can
> be accelerated through query result caching and materialized views.
>
> I'm not that familiar with Hive 3 implementation to know if this feature
> would be simple to make. I was hoping to change only the front-end of Hive
> and keep the ACID back-end implementation intact. For example, we could
> reuse the transactional_properties and add 'read_only' as a new value. With
> read-only tables, all INSERT, UPDATE, DELETE statements will fail at Hive
> front-end. Thus, it ensures that the ACID properties are guaranteed and the
> rest of ACID assumptions on the backend could continue to work. For DDL
> operations, since it has to go through the metastore I think it would
> automatically work with the current ACID code base and the only thing we
> need to do is to enable (where it was disabled) and test it.
>
> On Wed, Apr 24, 2019 at 6:05 PM Alan Gates <al...@gmail.com> wrote:
>
> > Would a workflow like the following work then:
> > 1. Non-Hive tool produces data
> > 2. Do a Hive load into a managed table.  This effectively takes a
> snapshot
> > of the data.
> > 3. Now you still have the data for Non-Hive tools to operate on, and in
> > Hive you get all the Hive 3 goodness.
> >
> > This would introduce an additional copy of the data.  It would be
> > interesting to look at adding a copy on write semantic to a partition to
> > avoid this copy, but you don't need that to get going.
> >
> > I'm not opposed to what you're suggesting, I'm just wondering if there
> are
> > other ways that will save you work and that will keep Hive more simple.
> >
> > Alan.
> >
> > On Wed, Apr 24, 2019 at 2:07 PM Thai Bui <bl...@gmail.com> wrote:
> >
> > > As I understand, read-only ACID tables only work if your table is a
> > managed
> > > table (so you'll have to create your table with CREATE TABLE
> > > .. TBLPROPERTIES ('transactional_properties'='insert_only') ) and Hive
> > will
> > > control the data layout.
> > >
> > > Unfortunately, in my case, I'm concerned with external tables where
> data
> > is
> > > written by other tools such as Spark, PySpark, Sqoop or older Hive
> > clusters
> > > and Hadoop-based systems to cloud storage such as S3. My wish is to
> have
> > > materialized views and query result caching work directly on those data
> > if
> > > and only if the table is registered as an external, read-only table in
> > Hive
> > > 3 via the same ACID mechanism.
> > >
> > > On Wed, Apr 24, 2019 at 3:35 PM Alan Gates <al...@gmail.com>
> wrote:
> > >
> > > > Have you looked at the insert only ACID tables in Hive 3 (
> > > > https://issues.apache.org/jira/browse/HIVE-14535 )?  These were
> > designed
> > > > specifically with the cloud in mind, since the way Hive traditionally
> > > adds
> > > > new data doesn't work well in the cloud.  And they do not require
> ORC,
> > > they
> > > > work with any file format.
> > > >
> > > > Alan.
> > > >
> > > > On Wed, Apr 24, 2019 at 12:04 PM Thai Bui <bl...@gmail.com>
> wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > Hive 3 has brought significant changes to the community with the
> > > support
> > > > > for ACID tables as default managed tables. With ACID tables, we can
> > use
> > > > > features such as materialized views, query result caching for BI
> > tools
> > > > and
> > > > > more. But without ACID tables such as external tables, Hive doesn't
> > > > support
> > > > > any of these advanced features which makes a majority of
> cloud-native
> > > > users
> > > > > like me sad :(.
> > > > >
> > > > > I propose we should support a more limited version of read-only
> > > external
> > > > > tables such that materialized views and query result caching would
> > > work.
> > > > > For example:
> > > > >
> > > > > CREATE EXTERNAL TABLE table_name (..) STORED AS ORC
> > > > > LOCATION 's3://some-bucket/some-dir'
> > > > > TBLPROPERTIES ('read-only': "true");
> > > > >
> > > > > In such tables, any data modification operations such as INSERT and
> > > > UPDATE
> > > > > would fail and DDL operations that "add" or "remove" partitions to
> > the
> > > > > table would succeed such as "ALTER TABLE ... ADD PARTITION". This
> > would
> > > > > make it possible for Hive to invalidate the cache and materialized
> > > views
> > > > > even when the table is an external table.
> > > > >
> > > > > Let me know what do you guys think and maybe I can start writing a
> > wiki
> > > > > document describing the approach in greater details.
> > > > >
> > > > > Thanks,
> > > > > Thai
> > > > >
> > > >
> > >
> > >
> > > --
> > > Thai
> > >
> >
>
>
> --
> Thai
>

Re: A proposal for read-only external table for cloud-native Hive deployment

Posted by Thai Bui <bl...@gmail.com>.

Your suggested workflow will work and it would require us to re-ETL data
from S3 to all over the place to multiple clusters. This is a cumbersome
approach since most of our data reside on S3 and clusters are somewhat
transient in nature (in the order of a few months for a redeployment &
don't have large HDFS capacity).

We do scale clusters up and down for compute but not for storage since HDFS
is not easy to be scaled down on demand. It would be much more preferable
in this architecture to have Hive behaves as a pure compute engine that can
be accelerated through query result caching and materialized views.

I'm not that familiar with Hive 3 implementation to know if this feature
would be simple to make. I was hoping to change only the front-end of Hive
and keep the ACID back-end implementation intact. For example, we could
reuse the transactional_properties and add 'read_only' as a new value. With
read-only tables, all INSERT, UPDATE, DELETE statements will fail at Hive
front-end. Thus, it ensures that the ACID properties are guaranteed and the
rest of ACID assumptions on the backend could continue to work. For DDL
operations, since it has to go through the metastore I think it would
automatically work with the current ACID code base and the only thing we
need to do is to enable (where it was disabled) and test it.

On Wed, Apr 24, 2019 at 6:05 PM Alan Gates <al...@gmail.com> wrote:

> Would a workflow like the following work then:
> 1. Non-Hive tool produces data
> 2. Do a Hive load into a managed table.  This effectively takes a snapshot
> of the data.
> 3. Now you still have the data for Non-Hive tools to operate on, and in
> Hive you get all the Hive 3 goodness.
>
> This would introduce an additional copy of the data.  It would be
> interesting to look at adding a copy on write semantic to a partition to
> avoid this copy, but you don't need that to get going.
>
> I'm not opposed to what you're suggesting, I'm just wondering if there are
> other ways that will save you work and that will keep Hive more simple.
>
> Alan.
>
> On Wed, Apr 24, 2019 at 2:07 PM Thai Bui <bl...@gmail.com> wrote:
>
> > As I understand, read-only ACID tables only work if your table is a
> managed
> > table (so you'll have to create your table with CREATE TABLE
> > .. TBLPROPERTIES ('transactional_properties'='insert_only') ) and Hive
> will
> > control the data layout.
> >
> > Unfortunately, in my case, I'm concerned with external tables where data
> is
> > written by other tools such as Spark, PySpark, Sqoop or older Hive
> clusters
> > and Hadoop-based systems to cloud storage such as S3. My wish is to have
> > materialized views and query result caching work directly on those data
> if
> > and only if the table is registered as an external, read-only table in
> Hive
> > 3 via the same ACID mechanism.
> >
> > On Wed, Apr 24, 2019 at 3:35 PM Alan Gates <al...@gmail.com> wrote:
> >
> > > Have you looked at the insert only ACID tables in Hive 3 (
> > > https://issues.apache.org/jira/browse/HIVE-14535 )?  These were
> designed
> > > specifically with the cloud in mind, since the way Hive traditionally
> > adds
> > > new data doesn't work well in the cloud.  And they do not require ORC,
> > they
> > > work with any file format.
> > >
> > > Alan.
> > >
> > > On Wed, Apr 24, 2019 at 12:04 PM Thai Bui <bl...@gmail.com> wrote:
> > >
> > > > Hello all,
> > > >
> > > > Hive 3 has brought significant changes to the community with the
> > support
> > > > for ACID tables as default managed tables. With ACID tables, we can
> use
> > > > features such as materialized views, query result caching for BI
> tools
> > > and
> > > > more. But without ACID tables such as external tables, Hive doesn't
> > > support
> > > > any of these advanced features which makes a majority of cloud-native
> > > users
> > > > like me sad :(.
> > > >
> > > > I propose we should support a more limited version of read-only
> > external
> > > > tables such that materialized views and query result caching would
> > work.
> > > > For example:
> > > >
> > > > CREATE EXTERNAL TABLE table_name (..) STORED AS ORC
> > > > LOCATION 's3://some-bucket/some-dir'
> > > > TBLPROPERTIES ('read-only': "true");
> > > >
> > > > In such tables, any data modification operations such as INSERT and
> > > UPDATE
> > > > would fail and DDL operations that "add" or "remove" partitions to
> the
> > > > table would succeed such as "ALTER TABLE ... ADD PARTITION". This
> would
> > > > make it possible for Hive to invalidate the cache and materialized
> > views
> > > > even when the table is an external table.
> > > >
> > > > Let me know what do you guys think and maybe I can start writing a
> wiki
> > > > document describing the approach in greater details.
> > > >
> > > > Thanks,
> > > > Thai
> > > >
> > >
> >
> >
> > --
> > Thai
> >
>


-- 
Thai

Re: A proposal for read-only external table for cloud-native Hive deployment

Posted by Alan Gates <al...@gmail.com>.

Would a workflow like the following work then:
1. Non-Hive tool produces data
2. Do a Hive load into a managed table.  This effectively takes a snapshot
of the data.
3. Now you still have the data for Non-Hive tools to operate on, and in
Hive you get all the Hive 3 goodness.

This would introduce an additional copy of the data.  It would be
interesting to look at adding a copy on write semantic to a partition to
avoid this copy, but you don't need that to get going.

I'm not opposed to what you're suggesting, I'm just wondering if there are
other ways that will save you work and that will keep Hive more simple.

Alan.

On Wed, Apr 24, 2019 at 2:07 PM Thai Bui <bl...@gmail.com> wrote:

> As I understand, read-only ACID tables only work if your table is a managed
> table (so you'll have to create your table with CREATE TABLE
> .. TBLPROPERTIES ('transactional_properties'='insert_only') ) and Hive will
> control the data layout.
>
> Unfortunately, in my case, I'm concerned with external tables where data is
> written by other tools such as Spark, PySpark, Sqoop or older Hive clusters
> and Hadoop-based systems to cloud storage such as S3. My wish is to have
> materialized views and query result caching work directly on those data if
> and only if the table is registered as an external, read-only table in Hive
> 3 via the same ACID mechanism.
>
> On Wed, Apr 24, 2019 at 3:35 PM Alan Gates <al...@gmail.com> wrote:
>
> > Have you looked at the insert only ACID tables in Hive 3 (
> > https://issues.apache.org/jira/browse/HIVE-14535 )?  These were designed
> > specifically with the cloud in mind, since the way Hive traditionally
> adds
> > new data doesn't work well in the cloud.  And they do not require ORC,
> they
> > work with any file format.
> >
> > Alan.
> >
> > On Wed, Apr 24, 2019 at 12:04 PM Thai Bui <bl...@gmail.com> wrote:
> >
> > > Hello all,
> > >
> > > Hive 3 has brought significant changes to the community with the
> support
> > > for ACID tables as default managed tables. With ACID tables, we can use
> > > features such as materialized views, query result caching for BI tools
> > and
> > > more. But without ACID tables such as external tables, Hive doesn't
> > support
> > > any of these advanced features which makes a majority of cloud-native
> > users
> > > like me sad :(.
> > >
> > > I propose we should support a more limited version of read-only
> external
> > > tables such that materialized views and query result caching would
> work.
> > > For example:
> > >
> > > CREATE EXTERNAL TABLE table_name (..) STORED AS ORC
> > > LOCATION 's3://some-bucket/some-dir'
> > > TBLPROPERTIES ('read-only': "true");
> > >
> > > In such tables, any data modification operations such as INSERT and
> > UPDATE
> > > would fail and DDL operations that "add" or "remove" partitions to the
> > > table would succeed such as "ALTER TABLE ... ADD PARTITION". This would
> > > make it possible for Hive to invalidate the cache and materialized
> views
> > > even when the table is an external table.
> > >
> > > Let me know what do you guys think and maybe I can start writing a wiki
> > > document describing the approach in greater details.
> > >
> > > Thanks,
> > > Thai
> > >
> >
>
>
> --
> Thai
>

Re: A proposal for read-only external table for cloud-native Hive deployment

Posted by Thai Bui <bl...@gmail.com>.

As I understand, read-only ACID tables only work if your table is a managed
table (so you'll have to create your table with CREATE TABLE
.. TBLPROPERTIES ('transactional_properties'='insert_only') ) and Hive will
control the data layout.

Unfortunately, in my case, I'm concerned with external tables where data is
written by other tools such as Spark, PySpark, Sqoop or older Hive clusters
and Hadoop-based systems to cloud storage such as S3. My wish is to have
materialized views and query result caching work directly on those data if
and only if the table is registered as an external, read-only table in Hive
3 via the same ACID mechanism.

On Wed, Apr 24, 2019 at 3:35 PM Alan Gates <al...@gmail.com> wrote:

> Have you looked at the insert only ACID tables in Hive 3 (
> https://issues.apache.org/jira/browse/HIVE-14535 )?  These were designed
> specifically with the cloud in mind, since the way Hive traditionally adds
> new data doesn't work well in the cloud.  And they do not require ORC, they
> work with any file format.
>
> Alan.
>
> On Wed, Apr 24, 2019 at 12:04 PM Thai Bui <bl...@gmail.com> wrote:
>
> > Hello all,
> >
> > Hive 3 has brought significant changes to the community with the support
> > for ACID tables as default managed tables. With ACID tables, we can use
> > features such as materialized views, query result caching for BI tools
> and
> > more. But without ACID tables such as external tables, Hive doesn't
> support
> > any of these advanced features which makes a majority of cloud-native
> users
> > like me sad :(.
> >
> > I propose we should support a more limited version of read-only external
> > tables such that materialized views and query result caching would work.
> > For example:
> >
> > CREATE EXTERNAL TABLE table_name (..) STORED AS ORC
> > LOCATION 's3://some-bucket/some-dir'
> > TBLPROPERTIES ('read-only': "true");
> >
> > In such tables, any data modification operations such as INSERT and
> UPDATE
> > would fail and DDL operations that "add" or "remove" partitions to the
> > table would succeed such as "ALTER TABLE ... ADD PARTITION". This would
> > make it possible for Hive to invalidate the cache and materialized views
> > even when the table is an external table.
> >
> > Let me know what do you guys think and maybe I can start writing a wiki
> > document describing the approach in greater details.
> >
> > Thanks,
> > Thai
> >
>


-- 
Thai

Re: A proposal for read-only external table for cloud-native Hive deployment

Posted by Alan Gates <al...@gmail.com>.

Have you looked at the insert only ACID tables in Hive 3 (
https://issues.apache.org/jira/browse/HIVE-14535 )?  These were designed
specifically with the cloud in mind, since the way Hive traditionally adds
new data doesn't work well in the cloud.  And they do not require ORC, they
work with any file format.

Alan.

On Wed, Apr 24, 2019 at 12:04 PM Thai Bui <bl...@gmail.com> wrote:

> Hello all,
>
> Hive 3 has brought significant changes to the community with the support
> for ACID tables as default managed tables. With ACID tables, we can use
> features such as materialized views, query result caching for BI tools and
> more. But without ACID tables such as external tables, Hive doesn't support
> any of these advanced features which makes a majority of cloud-native users
> like me sad :(.
>
> I propose we should support a more limited version of read-only external
> tables such that materialized views and query result caching would work.
> For example:
>
> CREATE EXTERNAL TABLE table_name (..) STORED AS ORC
> LOCATION 's3://some-bucket/some-dir'
> TBLPROPERTIES ('read-only': "true");
>
> In such tables, any data modification operations such as INSERT and UPDATE
> would fail and DDL operations that "add" or "remove" partitions to the
> table would succeed such as "ALTER TABLE ... ADD PARTITION". This would
> make it possible for Hive to invalidate the cache and materialized views
> even when the table is an external table.
>
> Let me know what do you guys think and maybe I can start writing a wiki
> document describing the approach in greater details.
>
> Thanks,
> Thai
>