You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/04/26 12:05:53 UTC
[GitHub] [incubator-doris] yiguolei opened a new issue, #9242: [Global Dict] Using global dict to accelerate low cardinality query performance

yiguolei opened a new issue, #9242:
URL: https://github.com/apache/incubator-doris/issues/9242

   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/incubator-doris/issues?q=is%3Aissue) and found no similar issues.
   
   
   ### Description
   
   Issue Description:
   
   In Doris, many dimension columns are string type with low cardinality (cardinality < 1024), such as some status codes, provinces, regions, genders, etc. During group by or join operation, it has to make a hash table in the form of strings. The efficiency is relatively low. Through performance testing, it is found that the performance is about 3 times worse than using int type. If we can convert string to int through a global dictionary during a query, then the query efficiency will be greatly improved. The solution is described below.
   
   Design:
   
   Maintain global dictionary for every column in FE as needed, for example it could be organized as map<(table_id,col_id), dict> in DictManager.
   When creating a table, set the attribute of the field to low_cardinality to identify whether the field is a low-cardinality field. For low-cardinality fields, use a global dictionary to speed up the corresponding query. The specific syntax is as follows:
   
   // create table
   create table table_name (
       col1 int,
       col2 string|varchar(10) low_cardinality
   );
   
   // alter table
   alter table table_name modify column col2 set low_cardinality=true
   
   At the beginning, the dict is empty, so it will not be used during query. FE will start a background task to obtain global dictionary information from BE for every low cardinality column, and the task is sent to BE in the form of SQL, as follows:
   
   select dict(col) from table[meta]
   
   So that there will be a new aggregation method dict and new scannode in BE. In the new scannode, BE reads the local dictionary of each field and get all dictinct strings. dict is an aggregation function that deduplicates and merges the dictionaries of multiple BEs and tablets. The final result is saved in FE's DictManager.
   
   During query, FE puts the global dictionary information into the Fragment structure and sends it to each BE. The structure is as follows:
   
   struct TStringColumnDict {
   list<int16> id,    // the offset or id of the string
   list<string> value    // the string value
   }
   struct TColumnDict {
   TStringColumnDict xxx,
   TDateColumnDict xxx,
   ....
   }
   struct TGlobalDict {
   map<id, TColumnDict> columnDicts // map from column id to column dict
   }
   struct TExecPlanFragmentParams {
   ...
   TGlobalDict globalDict
   }
   
   The OlapScanNode will encodes the string data according to the value and ID of each string in the dict, and the subsequent calculations (such as group by) are operated according to int, and at the final result sink node, the result is decode into the corresponding string.
   
   
   Maintain GlobalDict Changes
   
   During process of loading, user will import new strings that are not in the GlobalDict, so this requires an update mechanism for the global dictionary. Stream load is actually a query, so it will also generate a fragment, and according to the task description of our query, there is a global dict in the fragment. In TabletSink, each value of the column of low cardinality should be evaluated with ColumnDict, if find a new string, the coordinator need to inform FE that the dict of this column is invalid. Specifically, we need to make some modifications in LoadTxnCommitRequest and add a field to indicate which columns are invalid.
   
   struct TLoadTxnCommitRequest {
   ...
   list<int> invalid_col_ids
   }
   
   Invalid col ids should be persisted to the transaction state. Once the transaction is published, we will delete the dict from the dict manager. In this way, the query will not use this dictionary optimization in the future. At the same time, FE generates a task to get the latest Dict. When the latest Dict is obtained, the query can be optimized using Dict again.
   
   
   
   ### Use case
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org