You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Marko Bauhardt <mb...@datameer.com> on 2019/06/21 15:00:14 UTC

Hive3 Managed Tables

Hi all,
I have a question about Hive3 Managed Tables and how they should be used
in a production environment, lets say in an enterprise environment.

As far as I understand, managed tables has a helpful set of features.
See https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_hive_3_tables.html
So I see many reasons to use managed tables instead external tables.

The hive documention says that the data of managed tables is completely
managed by Hive. That means the managed table space (hdfs path) is owned
by the user `hive`. And only the owner has `rwx` to this path. No one
else. So using `beeline` with another user than `hive` or even with
`hive` but with impersonation/proxy-user does not give me the access to the data
via select statement.

In an enterprise environment impersonation plays an important role. To
allow access to the data `ranger` (in HDP) comes into the game.
Is my assumption correct to use `ranger` to set ACL's to
allow a set of groups/users the access to the path of specific *managed* tables?

Second question...
If ranger opens the door to the data, i'm able to read the data directly
from the HDFS, lets say with a third party tool. But I believe this is
not a good option based on the fact how Hive is working with
transactional tables. See
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_3_internals.html
What I mean is the usage of deltas/buckets etc. Do you agree, direct
access to the HDFS files in the managed table space is not recommended?

Thanks,
Marko

Re: Hive3 Managed Tables

Posted by Marko Bauhardt <mb...@datameer.com>.
Quoting Lars Francke (2019-06-21 17:12:10)
> Hi Markus,

Hi Lars,

> ...nothing is stopping you from having your
> data owned by someone other than Hive as long as Hive can access the data.
> 
> This happens quite frequently when other tools ingest data into a directory
> that hive uses (whether this would be better as an external table is another
> discussion).

Ok. Sounds reasonable, I mean the official docu from HDP describes it
more strictly, but...
The question is if enterprise users will change the official documented
configuration/setup.

> In Ranger you specify ACLs for users but - in this case - you'd specify them
> for Hive objects...
> Sentry works slightly different: Here you grant access to Hive objects as well
> but Sentry can automatically also grant HDFS access.

Ah ok. Didn't know that difference.

> You are correct though that an external user shouldn't meddle with "internals"
> but I see no harm in getting read-only access.

Yes, I think also read access is ok.

> Hope that helps.

Yes, sure.
Thanks for the quick answer.

> Cheers,
> Lars

Cheers
Marko


Re: Hive3 Managed Tables

Posted by Lars Francke <la...@gmail.com>.
Hi Markus,

you're on the right track but not quite and got a few things wrong/confused.

1) Managed tables are purely a Hive feature, it has nothing to do with the
underlying storage. Hive _assumes_ that it has full (and sole) control over
the data when it's a managed table. But nothing is stopping you from having
your data owned by someone other than Hive as long as Hive can access the
data.

This happens quite frequently when other tools ingest data into a directory
that hive uses (whether this would be better as an external table is
another discussion).

2) When you want to access data through Hive, it can consult an
authorization plugin (technically not 100% correct but good enough for now)
and ask "Hey is this user allowed to do that action?", this plugin can then
ask Sentry or Ranger or something else for a decision.

In Ranger you specify ACLs for users but - in this case - you'd specify
them for Hive objects. e.g. user Marko is allowed to look at the
"customers" table. That does not give you _any_ automatic permissions on
HDFS so you're safe there.

Sentry works slightly different: Here you grant access to Hive objects as
well but Sentry can automatically also grant HDFS access. If you have the
permission to SELECT * a table then you can also read the data straight
from HDFS.

So, the scenario you outlined shouldn't happen when you use Ranger but can
happen when you use Sentry. These two projects might merge in the future
now that Cloudera and Hortonworks have merged. We'll see.
You are correct though that an external user shouldn't meddle with
"internals" but I see no harm in getting read-only access.

Hope that helps.

Cheers,
Lars

On Fri, Jun 21, 2019 at 5:00 PM Marko Bauhardt <mb...@datameer.com> wrote:

> Hi all,
> I have a question about Hive3 Managed Tables and how they should be used
> in a production environment, lets say in an enterprise environment.
>
> As far as I understand, managed tables has a helpful set of features.
> See
> https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_hive_3_tables.html
> So I see many reasons to use managed tables instead external tables.
>
> The hive documention says that the data of managed tables is completely
> managed by Hive. That means the managed table space (hdfs path) is owned
> by the user `hive`. And only the owner has `rwx` to this path. No one
> else. So using `beeline` with another user than `hive` or even with
> `hive` but with impersonation/proxy-user does not give me the access to
> the data
> via select statement.
>
> In an enterprise environment impersonation plays an important role. To
> allow access to the data `ranger` (in HDP) comes into the game.
> Is my assumption correct to use `ranger` to set ACL's to
> allow a set of groups/users the access to the path of specific *managed*
> tables?
>
> Second question...
> If ranger opens the door to the data, i'm able to read the data directly
> from the HDFS, lets say with a third party tool. But I believe this is
> not a good option based on the fact how Hive is working with
> transactional tables. See
>
> https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_3_internals.html
> What I mean is the usage of deltas/buckets etc. Do you agree, direct
> access to the HDFS files in the managed table space is not recommended?
>
> Thanks,
> Marko
>