You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Janos Kovacs (Jira)" <ji...@apache.org> on 2023/09/20 10:40:00 UTC

[jira] [Commented] (HIVE-27713) Iceberg: metadata location overrides can cause data breach

    [ https://issues.apache.org/jira/browse/HIVE-27713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767079#comment-17767079 ] 

Janos Kovacs commented on HIVE-27713:
-------------------------------------

Proposal from the original ticket before split into subtasks:

 


Proposed enhancement:
 * The URL for the iceberg table's rwstorage authorization should be changed the following way
 ** the <database>/<table>?<location> is good but
 *** the location should not be url encoded, or at least the authorizer should check the policy against the decoded url
 *** the separator between the table and location should be "/" instead of "?" as "?" might be mixed with its regex meaning!
 *** "/" as separator can be also confusing as the absolute paths would start with it. Might be that another separator character that does not conflict with regex, paths and table-name valid characters would be even better.
 ** the "snapshot=" seems to be non-relevant in this context, it should not be part of the <location>
 ** There is a need to differentiate the cases where location is only generated or when location is explicitly provided by end user. For this, the "default" location might just not be generated as path but replaced with "default_location" fixed value - note! it has no leading "/". That way a single policy definition could be used to cover all tables in their default locations like:
{noformat}
iceberg://mydatabase/*/default_location           or
iceberg://mydatabase/*/snapshot=default_location{noformat}

 * 
 ** 
 *** "default" here means the table's default location which can depend on the warehouse location, the database location and the table's explicit location
 *** I know many developers don't like these type of hardcoded static values but with such a value there is no need to modify the rwstorage authorization (and we already using similar method for "METASTOER" type of storagehandler authorization)
 * The authorization request should include the rwstorage authorization only if the CERATE/ALTER/DROP is against the Iceberg table and not if in such statements the iceberg table is only a source - that might be already in fix via HIVE-27304
 * When any custom Iceberg metadata.json location is provided then the <location> must contain to provided path to be able to properly authorize it.
 * We either should not allow backstep "/../" in any locations or give an option to filter these out in the final authorization step.
Like with a policy on
{noformat}
iceberg://mydatabase/mytable//data/use-case-1/*             or
iceberg://mydatabase/mytable/snapshot=/data/use-case-1/*{noformat}
should not match access for
{noformat}
iceberg://mydatabse/mytable//data/use-case-1/../use-case-2/*          or
iceberg://mydatabse/mytable/snapshot=/data/use-case-1/../use-case-2/*{noformat}

 * 
 ** Not allowing "/../" might be easier to handle on hive/impala side
 ** But "/../" might be still valid if it used within an allowed locations. --> users just should use proper location w/o "/../"

 

With the above changes one single new default Ranger policy could be used to globally enable the creation of iceberg tables in any databases in any table locations but NOT using any custom metadata locations (in case if user doesn't want to globally turn storagehandler authorization off - via hive.security.authorization.tables.on.storagehandlers=false - because of other - hbase, kafka - handlers):
{noformat}
iceberg://*/*/default_location              or
iceberg://*/*/snapshot=default_location  {noformat}
Note that there is no extra "/" in front of "default_location", only the separator character in the first case

 

Also with the above changes we could allow users to configure authorizations for custom/shared data location via:
{noformat}
iceberg://mydatabase/*//some/shared/table/location/*              or
iceberg://mydatabase/*/snapshot=/some/shared/table/location/* {noformat}
 

> Iceberg: metadata location overrides can cause data breach
> ----------------------------------------------------------
>
>                 Key: HIVE-27713
>                 URL: https://issues.apache.org/jira/browse/HIVE-27713
>             Project: Hive
>          Issue Type: Bug
>          Components: Authorization, Iceberg integration
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Janos Kovacs
>            Assignee: Ayush Saxena
>            Priority: Major
>              Labels: check
>
> Set to bug/blocker instead of enhancement due to its security related nature, Hive4 should not be released w/o fix for this. Please reset if needed.
>  
> Context: 
>  * There are some core tables with sensitive data that users can only query with data masking enforced (e.g. via Ranger). Let's assume this is the `default.icebergsecured` table.
>  * An end-user can only access the masked form of the sensitive data as expected...
>  * The users also have privilege to create new tables in their own sandbox databases - let's assume this is the `default.trojanhorse` table for now.
>  * The user can create a malicious table that exposes the sensitive data non-masked leading to a possible data breach.
>  * Hive runs with doAs=false to be able to enforce FGAC and prevent end-user direct file-system access needs
> Repro:
>  * First make sure the data is secured by the masking policy:
> {noformat}
> <kinit as privileged user>
> beeline -e "
> DROP TABLE IF EXISTS default.icebergsecured PURGE;
> CREATE EXTERNAL TABLE default.icebergsecured (txt string, secret string) STORED BY ICEBERG;
> INSERT INTO default.icebergsecured VALUES ('You might be allowed to see this.','You are NOT allowed to see this!');
> "
> <kinit as end user>
> beeline -e "
> SELECT * FROM default.icebergsecured;
> "
> +------------------------------------+--------------------------------+
> |         icebergsecured.txt         |     icebergsecured.secret      |
> +------------------------------------+--------------------------------+
> | You might be allowed to see this.  | MASKED BY RANGER FOR SECURITY  |
> +------------------------------------+--------------------------------+
> {noformat}
>  * Now let the user to create the malicious table exposing the sensitive data:
> {noformat}
> <kinit as end user>
> SECURED_META_LOCATION=$(HADOOP_CLIENT_OPTS="-Djline.terminal=jline.UnsupportedTerminal" beeline -e "DESCRIBE FORMATTED default.icebergsecured;" 2>/dev/null |grep metadata_location  |grep -v previous_metadata_location | awk '{print $5}')
> beeline -e "
> DROP TABLE IF EXISTS default.trojanhorse;
> CREATE EXTERNAL TABLE default.trojanhorse (txt string, secret string) STORED BY ICEBERG
> TBLPROPERTIES (
>   'metadata_location'='${SECURED_META_LOCATION}');
> SELECT * FROM default.trojanhorse;
> "
> +------------------------------------+-----------------------------------+
> |          trojanhorse.txt           |        trojanhorse.secret         |
> +------------------------------------+-----------------------------------+
> | You might be allowed to see this.  | You are not allowed to see this!  |
> +------------------------------------+-----------------------------------+
> {noformat}
>  
> Currently - after HIVE-26707 - the rwstorage authorization only has either the dummy path or the explicit path set for uri:  
> {noformat}
> Permission denied: user [oozie] does not have [RWSTORAGE] privilege on 
> [iceberg://default/trojanhorse?snapshot=%2Fwarehouse%2Ftablespace%2Fexternal%2Fhive%2Ftrojanhorse%2Fmetadata%2Fdummy.metadata.json]
> Permission denied: user [oozie] does not have [RWSTORAGE] privilege on 
> [iceberg://default/trojanhorse?snapshot=%2Fwarehouse%2Ftablespace%2Fexternal%2Fhive%2Ficebergsecured%2Fmetadata%2F00001-f4c2a428-30ce-4afd-82ff-d46ecbf02244.metadata.json] 
> {noformat}
> With custom location it's even not passed to the authorizer:
> {noformat}
> 2023-05-17 19:38:51,867 INFO  org.apache.hadoop.hive.ql.Driver: [a49356b4-1b7a-4c9d-9b70-81af12c0465f HiveServer2-Handler-Pool: Thread-253]: Compiling command(queryId=hive_20230517193851_8b9f0ad7-2ae1-4078-b76a-e51c31321b0b): 
> CREATE EXTERNAL TABLE default.policytestth (txt string, secret string) STORED BY ICEBERG 
> TBLPROPERTIES (
>   'metadata_location'='hdfs://test.local.host:8020/warehouse/tablespace/external/hive/policytest/metadata/00001-a3e46c1b-318b-4b46-886a-c6ea591f63c1.metadata.json')
> ...
> 2023-05-17 19:38:51,898 DEBUG org.apache.iceberg.mr.hive.HiveIcebergStorageHandler: [a49356b4-1b7a-4c9d-9b70-81af12c0465f HiveServer2-Handler-Pool: Thread-253]: 
> Iceberg storage handler authorization URI 
> iceberg://default/policytestth?snapshot=%2Fwarehouse%2Ftablespace%2Fexternal%2Fhive%2Fpolicytestth%2Fmetadata%2Fdummy.metadata.json
> {noformat}
>  
> Mandatory changes required for securing tables:
>  * Custom location needs to be passed to the Authorizer
> Changes required for usability - e.g. to eliminate the need to require a policy for each tables:
>  * Default location needs to be calculated based on warehouse/database def. location
>  * CREATE/ALTER with default locations should not involve RWStorage Authorization or should be handled a special way in the Authorizer. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)