You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Janos Kovacs (Jira)" <ji...@apache.org> on 2023/05/05 07:16:00 UTC

[jira] [Created] (HIVE-27322) Iceberg: metadata location overrides can cause data breach

Janos Kovacs created HIVE-27322:
-----------------------------------

             Summary: Iceberg: metadata location overrides can cause data breach
                 Key: HIVE-27322
                 URL: https://issues.apache.org/jira/browse/HIVE-27322
             Project: Hive
          Issue Type: Bug
          Components: Iceberg integration
    Affects Versions: 4.0.0-alpha-2
            Reporter: Janos Kovacs


Set to bug/blocker instead of enhancement due to its security related nature, Hive4 should not be released w/o fix for this. Please reset if needed.

 

Context: 
 * There are some core tables with sensitive data that users can only query with data masking enforced (e.g. via Ranger). Let's assume this is the `default.icebergsecured` table.
 * An end-user can only access the masked form of the sensitive data as expected...
 * The users also have privilege to create new tables in their own sandbox databases - let's assume this is the `default.trojanhorse` table for now.
 * The user can create a malicious table that exposes the sensitive data non-masked leading to a possible data breach.

Repro:
 * First make sure the data is secured by the masking policy:
{noformat}
<kinit as privileged user>
beeline -e "
DROP TABLE IF EXISTS default.icebergsecured PURGE;
CREATE EXTERNAL TABLE default.icebergsecured (txt string, secret string) STORED BY ICEBERG;
INSERT INTO default.icebergsecured VALUES ('You might be allowed to see this.','You are NOT allowed to see this!');
"

<kinit as end user>
beeline -e "
SELECT * FROM default.icebergsecured;
"

+------------------------------------+--------------------------------+
|         icebergsecured.txt         |     icebergsecured.secret      |
+------------------------------------+--------------------------------+
| You might be allowed to see this.  | MASKED BY RANGER FOR SECURITY  |
+------------------------------------+--------------------------------+
{noformat}

 * Now let the user to create the malicious table exposing the sensitive data:
{noformat}
<kinit as end user>
SECURED_META_LOCATION=$(HADOOP_CLIENT_OPTS="-Djline.terminal=jline.UnsupportedTerminal" beeline -e "DESCRIBE FORMATTED default.icebergsecured;" 2>/dev/null |grep metadata_location  |grep -v previous_metadata_location | awk '{print $5}')
beeline -e "
DROP TABLE IF EXISTS default.trojanhorse;
CREATE EXTERNAL TABLE default.trojanhorse (txt string, secret string) STORED BY ICEBERG
TBLPROPERTIES (
  'metadata_location'='${SECURED_META_LOCATION}');
SELECT * FROM default.trojanhorse;
"

+------------------------------------+-----------------------------------+
|          trojanhorse.txt           |        trojanhorse.secret         |
+------------------------------------+-----------------------------------+
| You might be allowed to see this.  | You are not allowed to see this!  |
+------------------------------------+-----------------------------------+
{noformat}

 

Currently - after HIVE-26707 - the rwstorage authorization only has either the dummy path or the explicit path set for uri:  
{noformat}
Permission denied: user [oozie] does not have [RWSTORAGE] privilege on 
[iceberg://default/trojanhorse?snapshot=%2Fwarehouse%2Ftablespace%2Fexternal%2Fhive%2Ftrojanhorse%2Fmetadata%2Fdummy.metadata.json]

Permission denied: user [oozie] does not have [RWSTORAGE] privilege on 
[iceberg://default/trojanhorse?snapshot=%2Fwarehouse%2Ftablespace%2Fexternal%2Fhive%2Ficebergsecured%2Fmetadata%2F00001-f4c2a428-30ce-4afd-82ff-d46ecbf02244.metadata.json] 
{noformat}
This is can be used only to decide whether a user is allowed to create iceberg tables in certain databases with certain names but controlling it's metadata location is hard in that form:
 * it does not provide a variable of "default table location" so a rule needs to know the per-database table location or per-catalog warehouse location to be able to construct it
 * it does not provide a rich regex to filter out `/../` style directory references
 * but basically there should be also a flag whether explicit metadata location is provided or not instead of the dummy reference, which then again needs explicit matching in the policy to handle

 

Proposed enhancement:
 * The URL for the iceberg table's rwstorage authorization should be changed the following way
 ** the <database>/<table>?<location> is good but
 *** the location should not be url encoded, or at least the authorizer should check the policy against the decoded url
 *** the separator between the table and location should be "/" instead of "?" as "?" might be mixed with its regex meaning!
 *** "/" as separator can be also confusing as the absolute paths would start with it. Might be that another separator character that does not conflict with regex, paths and table-name valid characters would be even better.
 ** the "snapshot=" seems to be non-relevant in this context, it should not be part of the <location>
 ** There is a need to differentiate the cases where location is only generated or when location is explicitly provided by end user. For this, the "default" location might just not be generated as path but replaced with "default_location" fixed value - note! it has no leading "/". That way a single policy definition could be used to cover all tables in their default locations like:
{noformat}
iceberg://mydatabase/*/default_location           or
iceberg://mydatabase/*/snapshot=default_location{noformat}

 * 
 ** 
 *** "default" here means the table's default location which can depend on the warehouse location, the database location and the table's explicit location
 *** I know many developers don't like these type of hardcoded static values but with such a value there is no need to modify the rwstorage authorization (and we already using similar method for "METASTOER" type of storagehandler authorization)
 * The authorization request should include the rwstorage authorization only if the CERATE/ALTER/DROP is against the Iceberg table and not if in such statements the iceberg table is only a source - that might be already in fix via HIVE-27304
 * When any custom Iceberg metadata.json location is provided then the <location> must contain to provided path to be able to properly authorize it.
 * We either should not allow backstep "/../" in any locations or give an option to filter these out in the final authorization step.
Like with a policy on
{noformat}
iceberg://mydatabase/mytable//data/use-case-1/*             or
iceberg://mydatabase/mytable/snapshot=/data/use-case-1/*{noformat}
should not match access for
{noformat}
iceberg://mydatabse/mytable//data/use-case-1/../use-case-2/*          or
iceberg://mydatabse/mytable/snapshot=/data/use-case-1/../use-case-2/*{noformat}

 * 
 ** Not allowing "/../" might be easier to handle on hive/impala side
 ** But "/../" might be still valid if it used within an allowed locations. --> users just should use proper location w/o "/../"

 

With the above changes one single new default Ranger policy could be used to globally enable the creation of iceberg tables in any databases in any table locations but NOT using any custom metadata locations (in case if user doesn't want to globally turn storagehandler authorization off - via hive.security.authorization.tables.on.storagehandlers=false - because of other - hbase, kafka - handlers):
{noformat}
iceberg://*/*/default_location              or
iceberg://*/*/snapshot=default_location  {noformat}
Note that there is no extra "/" in front of "default_location", only the separator character in the first case

 

Also with the above changes we could allow users to configure authorizations for custom/shared data location via:
{noformat}
iceberg://mydatabase/*//some/shared/table/location/*              or
iceberg://mydatabase/*/snapshot=/some/shared/table/location/* {noformat}
  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)