You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@shardingsphere.apache.org by GitBox <gi...@apache.org> on 2022/09/30 13:46:09 UTC

[GitHub] [shardingsphere] cheese8 commented on issue #20435: How to support the fuzzy query of encrypted fields

cheese8 commented on issue #20435:
URL: https://github.com/apache/shardingsphere/issues/20435#issuecomment-1263597723

   > After analysis, we draw the following conclusions
   > 
   > **1. Serious encryption algorithms are not available on regular databases**
   > 
   > Serious encryption algorithms such as: Homomorphic Encryption, Searchable Encryption. Both require the database to have the ability to execute related algorithms, which the regular database does not support. You can refer to an experimental implementation based on MySQL https://css.csail.mit.edu/cryptdb/
   > 
   > **2. In order to use the LIKE function of regular database, character by character "encryption" must be used**
   > 
   > Reversible single-character (two-character, few-character) encryption-decryption solution can be cracked by rainbow tables (fixed input corresponds to fixed output, and fixed output corresponds to unique deterministic input). The entire ciphertext can be reversely deduced on the basis of knowing part of the original text and ciphertext. For example, if you know 100 original-ciphertexts, you can reverse all the remaining 10,000 ciphertexts.
   > 
   > **Losted data cannot be retrieved**. Cryptographic hash function can maintain a certain level of security even if the algorithm and secret key are leaked. Weakness: Due to the loss of data, the precision is reduced. Core principle: Multiple different inputs have the same output result, so the inverse result is not fixed.
   > 
   > The single-character digest algorithm can be configured to balance precision and security. The core idea of ​​the algorithm is as follows: Perform the AND operation on the character ascii code, and lose some binary bits (1 bit is lost, the original result is 1:2; and 2 bits are lost, is 1:4 etc.): For example: 1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ with mask 0b0111_1111_1101, the result is 1014545890a`adedehihilmlmpqpqtutuxyxA@ADEDEHIHILMLMPQPQTUTUXYX
   > 
   > The original SQL like produces 1 record. After losing the precision, it may have 3 records. As the input characters increase, the precision improves, which is the same as the original SQL like semantics.
   > 
   > You can balance precision and security by adjusting the mask.
   > 
   > For Chinese character, if the algorithm and mask are leaked, one output can deduce multiple possible inputs, but excluding rare words, the possible inputs will be greatly reduced, resulting in reduced security. Therefore, we will make a dictionary of common words, no longer mask the unicode code, mask the “index” of the word in the dictionary. Therefore, leaked the algorithm, mask, and dictionary. one output can be reversed multiple common words to improve security, the code as follows:
   > 
   > ```
   > 
   >     // ascii 偏移量 防止出现原文
   >     private int delta = 1;
   >     // 二进制mask用于丢失精度
   >     private int mask = 0b1111_1011_1101;
   >     // 密文起始unicode
   >     private int start = 0x4e00;
   >     
   >     private String dict = "谤杉巫夏辅俯鸵直菱梨滨头矾讯芯巡泥簇何逊谜...";
   >     
   >     public static HashMap<Character, Integer> DICT_SEARCH = new HashMap<>(4000);
   >     
   >     
   >     @Override
   >     public String encrypt(final String plainValue, final EncryptContext encryptContext) {
   >         if (null == plainValue) {
   >             return null;
   >         }
   >         StringBuilder sb = new StringBuilder(plainValue.length());
   >         plainValue.chars().forEachOrdered(c -> {
   >             if ('%' == c) {
   >                 sb.append((char) c);
   >             } else {
   >                 int masked;
   >                 if (c > 256) {
   >                     Integer dictCode = DICT_SEARCH.get((char) c);
   >                     if (dictCode != null) {
   >                         c = dictCode;
   >                     }
   >                     masked = ((c + delta) & mask) + start;
   >                 } else {
   >                     masked = (c + delta) & mask;
   >                 }
   >                 if ('%' == masked) {
   >                     sb.append((char) c);
   >                 } else {
   >                     sb.append((char) masked);
   >                 }
   >             }
   >         });
   >         return sb.toString();
   >     }
   > ```
   > 
   > For more information, please refer to the attached PDF [隐私数据加密最佳实践.pdf](https://github.com/apache/shardingsphere/files/9683302/default.pdf). We have implemented the above features based on version 5.2.0. My coworker @gxxiong is merging it into the master branch, and will submit the PR later. Welcome to discuss this idea.
   
   Thanks for your comments,  and I think it's a implementable idea/solution for the `LIKE`, and we are almost matched.
   
   I think what I should say is: 
   In terms of decoupling, we could have 4 columns for the logicColumn at most: the plainColumn, the cipherColumn, the assistedQueryColumn and the fuzzyQueryColumn,   the cipherText at cipherColumn is for decryption purposes, the assistedQueryText at assistedQueryColumn is for `=` query, so the fuzzyQueryText at fuzzyQueryColumn is for `LIKE` query only. So, I think even if the fuzzy query algorithm which is specified by fuzzyQueryEncryptorName is very simple or not very safe, as long as it could desensitize the original plaintext, it's ok.
   
   Welcome to submit the PR, and I think there are below sub tasks:
   
   - [ ] add the fuzzyQueryColumn and fuzzyQueryEncryptorName
   - [ ] rewrite the sql while `LIKE` operator contained and UT, SIT
   - [ ] refactor the visit logic for distsql and UT, SIT
   - [ ] support the fuzzyQueryColumn and fuzzyQueryEncryptorName for distsql and UT, SIT
   - [ ] provider fuzzy query algorithms at least 1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@shardingsphere.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org