You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@kyuubi.apache.org by "huangzhir (via GitHub)" <gi...@apache.org> on 2023/04/01 06:43:50 UTC

[GitHub] [kyuubi] huangzhir commented on pull request #4643: [KYUUBI #4530] [AUTHZ] fixbug support MASK_SHOW_FIRST_4 和 MASK_SHOW_FIRST_4 chinese data mask

huangzhir commented on PR #4643:
URL: https://github.com/apache/kyuubi/pull/4643#issuecomment-1492853202

   Let me summarize how this issue came about and how Hive, Spark, and Trino handle it.
   
   Hive's data masking is implemented using the functions mask({col}), mask_show_last_n({col}, 4, 'x', 'x', 'x', -1, '1'), and mask_show_first_n({col}, 4, 'x', 'x', 'x', -1, '1') (see https://github.com/apache/ranger/blob/7f5b82bff2df72f20f5c41ba095406d354f8acf0/agents-common/src/main/resources/service-defs/ranger-servicedef-hive.json#L387). 
   ```json 
                        {
   				"itemId": 3,
   				"name": "MASK_SHOW_FIRST_4",
   				"label": "Partial mask: show first 4",
   				"description": "Show first 4 characters; replace rest with 'x'",
   				"transformer": "mask_show_first_n({col}, 4, 'x', 'x', 'x', -1, '1')"
   			}
   ```
   However, the implementation in the code ignores non-English character sets and simply returns the original data (see https://github.com/apache/hive/blob/7b3ecf617a6d46f48a3b6f77e0339fd4ad95a420/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMask.java#L262).
   ```java 
         default:
           if(maskedOtherChar != UNMASKED_VAL) {
             return maskedOtherChar;
           }
           break;
       }
   ```
   
   Regarding the related mask functions in Spark, there was a JIRA ticket (https://issues.apache.org/jira/browse/SPARK-23901) to add mask-related functions, but it was later decided that this was not a universal method, so the code was rolled back, and Spark does not currently have an implementation of mask-related functions.
   
   Trino also does not implement these mask-related functions, but instead uses the regexp_replace function for data masking. However, Trino's regexp_replace function supports lambda expressions (see https://github.com/apache/ranger/blob/a0224b2fdef999b3e23e2374080df94bf38557a4/agents-common/src/main/resources/service-defs/ranger-servicedef-trino.json#L393).
   ```json
   {
           "itemId": 2,
           "name": "MASK_SHOW_LAST_4",
           "label": "Partial mask: show last 4",
           "description": "Show last 4 characters; replace rest with 'X'",
           "transformer": "cast(regexp_replace({col}, '(.*)(.{4}$)', x -> regexp_replace(x[1], '.', 'X') || x[2]) as {type})"
         }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org