You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Elliot West <te...@gmail.com> on 2018/04/20 13:30:28 UTC

Proposal: Apply SQL based authorization functions in the metastore.

Hello,

I’d like to propose that SQL based authorization (or something similar) be
applied and enforced also in the metastore service as part of the
initiative to extract HMS as an independent project. While any such
implementation cannot be ’system complete’ like HiveServer2 (HS2) (HMS has
no scope to intercept operations applied to table data, only metadata), it
would be a significant step forward for controlling the operations that can
be actioned by the many non-HS2 clients in the Hive ecosystem.

I believe this is a good time to consider this option as there is currently
much discussion in the Hive community on the future directions of HMS and
greater recognition that HMS is now seen as general data platform
infrastructure and not simply an internal Hive component.

Further details are below. I’d be grateful for any feedback, thoughts, and
suggestions on how this could move forward.

*Problem*
At this time, Hive’s SQL based authorization feature is the recommended
approach for controlling which operations may be performed on what by whom.
This feature is applied in the HS2 component. However, a large number of
platforms that integrate with Hive do not do so via HS2, instead talking to
the metastore service directly and so bypassing authorization. They can
perform destructive operations such as a table drop even though the
permissions declared in the metastore may explicitly forbid it as they are
able to circumvent the authorization logic in HS2.

In short, there seems to be a lack of encapsulation with authorization in
the metastore; HMS owns the metadata, is responsible for performing actions
on metadata, for maintaining permissions on what actions are permissible by
whom, and yet has no means to use the information it has to protect the
data it owns.

*Workarounds*
Common workarounds to this deficiency include falling back to storage based
authorization or running read only metastore instances. However, both of
these approaches have significant drawbacks:

   - File based auth does not function when using object stores such as S3
   and so is not usable in cloud deployments of Hive - a pattern that is
   seeing significant growth.
   - Read only metastores incur significant infrastructure and operational
   overheads, requiring a separate set of server instances, while delivering
   little functionality and blunt authorization capabilities. You cannot for
   example restrict a particular operation type, by a certain user, on a
   specific table. You are literally blocking all writes by directing
   different user groups to different network endpoints.

*Anti-patterns*
It might be tempting to simply suggest using HS2 for all access to Hive
data. However, while this is conceptually appealing, it’s not practical to
apply on large, rich, and diverse data platforms where tool
interoperability and broad compatibility is required. Additionally, it can
be argued that the API exposed by HS2, while useful for analytical tools,
is not fit for use by large ETL processes; for example: using a “SELECT *”
over JDBC as a source for a large Spark job doesn’t scale.

*High level implementation notes*
I believe that the HMS requires little (if any) refactoring to support the
implementation of SQL based auth in the metastore. It currently maintains
all of the necessary metadata that describes the authorization rules that
should be applied. It also has access to the principle wishing to perform a
certain action via the UGI mechanism. Finally, there is an existing hook
mechanism to intercept metadata operations and apply authorization.

In deployments that use HS2 exclusively, the proposed metastore resident
SQL based auth could either be disabled or used harmlessly in conjunction
with the HS2 implementation.

Thanks,

Elliot.

Elliot West
Senior Engineer
Data Platform Team
Hotels.com

Re: Proposal: Apply SQL based authorization functions in the metastore.

Posted by Thejas Nair <th...@gmail.com>.

Hi Elliot,

One scenario where Storage based authorization doesn't work is the
case of object stores such as S3. In those scenarios, the
tool/platform that is accessing the data won't have any restrictions
on data access either. I am not sure how the data access would be
secured in such cases, even if metastore access is controlled.

Overall, the metastore api is a much lower level API, and as a result
it is difficult to enforce higher level restrictions at that level.
(More on that below).

I agree that O/JDBC via HS2 is not something distributed tools can use
(at least with standard API).
I think the ideal way to enforce security is having tools/platforms
read via a 'table server' (and not give them direct file system
access).
At Hortonworks, we have been using this to provide security for Spark,
by allowing it to read in parallel from LLAP deamons -
https://www.slideshare.net/Hadoop_Summit/security-updates-more-seamless-access-controls-with-apache-spark-and-apache-ranger
https://github.com/hortonworks-spark/spark-llap/wiki/1.-Goal-and-features
(You can replace Ranger with SQL auth as well in above examples).

The next phase of that work would likely make use Apache Arrow for the
data exchange (there are some hive jiras created recently around it).

I had considered having the authorization at metastore level, but
realized that is not the right place to enforce the RDBMS/SQL style
policies. Here are some notes I wrote while back about it -
http://hadoop-pig-hive-thejas.blogspot.com/2014/03/hive-sql-standard-authorization-why-not.html

Quoting from there -
The advantage of doing it at the metastore api level would have been
that pig and MR would also be covered under this authorization model.
But this works only if the SQL actions always needs some metastore api
calls, and access control on these calls it needs to make can be used
to enforce the SQL level authorization.

Take for example INSERT privilege in SQL, you can grant INSERT without
granting SELECT privilege. But when processing insert queries for the
user, we need to be able to do a getTable() and read the schema of the
table. But if you look at it from metastore api perspective, you
should not be able to do a getTable() without having SELECT privileges
on the table.
Similar issues happen with DELETE and UPDATE privileges, which you can
grant without SELECT.

Another example is URIs in the SQL statement, you don't need to make
any metatore api calls before access URIs. So URI access control can't
be implemented using metastore api calls.

Another use case is anything that you want to allow the ADMIN to do
but the action does not involve specific metastore api calls that can
be used to control the action.

Thanks,
Thejas

On Fri, Apr 20, 2018 at 6:30 AM, Elliot West <te...@gmail.com> wrote:
> Hello,
>
> I’d like to propose that SQL based authorization (or something similar) be
> applied and enforced also in the metastore service as part of the initiative
> to extract HMS as an independent project. While any such implementation
> cannot be ’system complete’ like HiveServer2 (HS2) (HMS has no scope to
> intercept operations applied to table data, only metadata), it would be a
> significant step forward for controlling the operations that can be actioned
> by the many non-HS2 clients in the Hive ecosystem.
>
> I believe this is a good time to consider this option as there is currently
> much discussion in the Hive community on the future directions of HMS and
> greater recognition that HMS is now seen as general data platform
> infrastructure and not simply an internal Hive component.
>
> Further details are below. I’d be grateful for any feedback, thoughts, and
> suggestions on how this could move forward.
>
> Problem
> At this time, Hive’s SQL based authorization feature is the recommended
> approach for controlling which operations may be performed on what by whom.
> This feature is applied in the HS2 component. However, a large number of
> platforms that integrate with Hive do not do so via HS2, instead talking to
> the metastore service directly and so bypassing authorization. They can
> perform destructive operations such as a table drop even though the
> permissions declared in the metastore may explicitly forbid it as they are
> able to circumvent the authorization logic in HS2.
>
> In short, there seems to be a lack of encapsulation with authorization in
> the metastore; HMS owns the metadata, is responsible for performing actions
> on metadata, for maintaining permissions on what actions are permissible by
> whom, and yet has no means to use the information it has to protect the data
> it owns.
>
> Workarounds
> Common workarounds to this deficiency include falling back to storage based
> authorization or running read only metastore instances. However, both of
> these approaches have significant drawbacks:
>
> File based auth does not function when using object stores such as S3 and so
> is not usable in cloud deployments of Hive - a pattern that is seeing
> significant growth.
> Read only metastores incur significant infrastructure and operational
> overheads, requiring a separate set of server instances, while delivering
> little functionality and blunt authorization capabilities. You cannot for
> example restrict a particular operation type, by a certain user, on a
> specific table. You are literally blocking all writes by directing different
> user groups to different network endpoints.
>
> Anti-patterns
> It might be tempting to simply suggest using HS2 for all access to Hive
> data. However, while this is conceptually appealing, it’s not practical to
> apply on large, rich, and diverse data platforms where tool interoperability
> and broad compatibility is required. Additionally, it can be argued that the
> API exposed by HS2, while useful for analytical tools, is not fit for use by
> large ETL processes; for example: using a “SELECT *” over JDBC as a source
> for a large Spark job doesn’t scale.
>
> High level implementation notes
> I believe that the HMS requires little (if any) refactoring to support the
> implementation of SQL based auth in the metastore. It currently maintains
> all of the necessary metadata that describes the authorization rules that
> should be applied. It also has access to the principle wishing to perform a
> certain action via the UGI mechanism. Finally, there is an existing hook
> mechanism to intercept metadata operations and apply authorization.
>
> In deployments that use HS2 exclusively, the proposed metastore resident SQL
> based auth could either be disabled or used harmlessly in conjunction with
> the HS2 implementation.
>
> Thanks,
>
> Elliot.
>
> Elliot West
> Senior Engineer
> Data Platform Team
> Hotels.com

Re: Proposal: Apply SQL based authorization functions in the metastore.

Posted by Thejas Nair <th...@gmail.com>.

Hi Elliot,

One scenario where Storage based authorization doesn't work is the
case of object stores such as S3. In those scenarios, the
tool/platform that is accessing the data won't have any restrictions
on data access either. I am not sure how the data access would be
secured in such cases, even if metastore access is controlled.

Overall, the metastore api is a much lower level API, and as a result
it is difficult to enforce higher level restrictions at that level.
(More on that below).

I agree that O/JDBC via HS2 is not something distributed tools can use
(at least with standard API).
I think the ideal way to enforce security is having tools/platforms
read via a 'table server' (and not give them direct file system
access).
At Hortonworks, we have been using this to provide security for Spark,
by allowing it to read in parallel from LLAP deamons -
https://www.slideshare.net/Hadoop_Summit/security-updates-more-seamless-access-controls-with-apache-spark-and-apache-ranger
https://github.com/hortonworks-spark/spark-llap/wiki/1.-Goal-and-features
(You can replace Ranger with SQL auth as well in above examples).

The next phase of that work would likely make use Apache Arrow for the
data exchange (there are some hive jiras created recently around it).

I had considered having the authorization at metastore level, but
realized that is not the right place to enforce the RDBMS/SQL style
policies. Here are some notes I wrote while back about it -
http://hadoop-pig-hive-thejas.blogspot.com/2014/03/hive-sql-standard-authorization-why-not.html

Quoting from there -
The advantage of doing it at the metastore api level would have been
that pig and MR would also be covered under this authorization model.
But this works only if the SQL actions always needs some metastore api
calls, and access control on these calls it needs to make can be used
to enforce the SQL level authorization.

Take for example INSERT privilege in SQL, you can grant INSERT without
granting SELECT privilege. But when processing insert queries for the
user, we need to be able to do a getTable() and read the schema of the
table. But if you look at it from metastore api perspective, you
should not be able to do a getTable() without having SELECT privileges
on the table.
Similar issues happen with DELETE and UPDATE privileges, which you can
grant without SELECT.

Another example is URIs in the SQL statement, you don't need to make
any metatore api calls before access URIs. So URI access control can't
be implemented using metastore api calls.

Another use case is anything that you want to allow the ADMIN to do
but the action does not involve specific metastore api calls that can
be used to control the action.

Thanks,
Thejas

On Fri, Apr 20, 2018 at 6:30 AM, Elliot West <te...@gmail.com> wrote:
> Hello,
>
> I’d like to propose that SQL based authorization (or something similar) be
> applied and enforced also in the metastore service as part of the initiative
> to extract HMS as an independent project. While any such implementation
> cannot be ’system complete’ like HiveServer2 (HS2) (HMS has no scope to
> intercept operations applied to table data, only metadata), it would be a
> significant step forward for controlling the operations that can be actioned
> by the many non-HS2 clients in the Hive ecosystem.
>
> I believe this is a good time to consider this option as there is currently
> much discussion in the Hive community on the future directions of HMS and
> greater recognition that HMS is now seen as general data platform
> infrastructure and not simply an internal Hive component.
>
> Further details are below. I’d be grateful for any feedback, thoughts, and
> suggestions on how this could move forward.
>
> Problem
> At this time, Hive’s SQL based authorization feature is the recommended
> approach for controlling which operations may be performed on what by whom.
> This feature is applied in the HS2 component. However, a large number of
> platforms that integrate with Hive do not do so via HS2, instead talking to
> the metastore service directly and so bypassing authorization. They can
> perform destructive operations such as a table drop even though the
> permissions declared in the metastore may explicitly forbid it as they are
> able to circumvent the authorization logic in HS2.
>
> In short, there seems to be a lack of encapsulation with authorization in
> the metastore; HMS owns the metadata, is responsible for performing actions
> on metadata, for maintaining permissions on what actions are permissible by
> whom, and yet has no means to use the information it has to protect the data
> it owns.
>
> Workarounds
> Common workarounds to this deficiency include falling back to storage based
> authorization or running read only metastore instances. However, both of
> these approaches have significant drawbacks:
>
> File based auth does not function when using object stores such as S3 and so
> is not usable in cloud deployments of Hive - a pattern that is seeing
> significant growth.
> Read only metastores incur significant infrastructure and operational
> overheads, requiring a separate set of server instances, while delivering
> little functionality and blunt authorization capabilities. You cannot for
> example restrict a particular operation type, by a certain user, on a
> specific table. You are literally blocking all writes by directing different
> user groups to different network endpoints.
>
> Anti-patterns
> It might be tempting to simply suggest using HS2 for all access to Hive
> data. However, while this is conceptually appealing, it’s not practical to
> apply on large, rich, and diverse data platforms where tool interoperability
> and broad compatibility is required. Additionally, it can be argued that the
> API exposed by HS2, while useful for analytical tools, is not fit for use by
> large ETL processes; for example: using a “SELECT *” over JDBC as a source
> for a large Spark job doesn’t scale.
>
> High level implementation notes
> I believe that the HMS requires little (if any) refactoring to support the
> implementation of SQL based auth in the metastore. It currently maintains
> all of the necessary metadata that describes the authorization rules that
> should be applied. It also has access to the principle wishing to perform a
> certain action via the UGI mechanism. Finally, there is an existing hook
> mechanism to intercept metadata operations and apply authorization.
>
> In deployments that use HS2 exclusively, the proposed metastore resident SQL
> based auth could either be disabled or used harmlessly in conjunction with
> the HS2 implementation.
>
> Thanks,
>
> Elliot.
>
> Elliot West
> Senior Engineer
> Data Platform Team
> Hotels.com