You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gsoc@community.apache.org by "Maxim Solodovnik (Jira)" <ji...@apache.org> on 2024/02/02 10:52:00 UTC

[jira] [Updated] (GSOC-252) [GSoC][Doris]Dictionary encoding optimization

     [ https://issues.apache.org/jira/browse/GSOC-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maxim Solodovnik updated GSOC-252:
----------------------------------
    Labels: Doris full-time gsoc2024  (was: full-time gsoc2024)

> [GSoC][Doris]Dictionary encoding optimization
> ---------------------------------------------
>
>                 Key: GSOC-252
>                 URL: https://issues.apache.org/jira/browse/GSOC-252
>             Project: Comdev GSOC
>          Issue Type: New Feature
>            Reporter: Calvin Kirs
>            Priority: Major
>              Labels: Doris, full-time, gsoc2024
>
> h2. Background
> Apache Doris is a modern data warehouse for real-time analytics.
> It delivers lightning-fast analytics on real-time data at scale.
> h2. Objectives
> Dictionary encoding optimization
> To save storage space, Doris uses dictionary encoding when storing string-type data in the storage layer if the cardinality is relatively low. Dictionary encoding involves mapping string values to integer values using a dictionary. The data can be stored directly as integers, and the dictionary information is stored separately. When reading the data, the integers are converted back to their corresponding string values based on the dictionary.
> The storage layer doesn't know whether a column has low or high cardinality when the data comes in. Currently, the implementation encodes the first page using dictionary encoding, and if the dictionary becomes too large, it indicates a column with high cardinality. Subsequent pages will not use dictionary encoding. However, even for columns with high cardinality, a dictionary page is still retained, which doesn't save storage space and adds additional memory overhead during reading as well as extra CPU overhead during decoding.
> Optimizations can be made to improve the memory and CPU overhead caused by dictionary encoding.
> h2. 
> Recommended Skills
>  
> Familiar with C++ programming
> Familiar with the storage layer of Doris
>  
> h2. Mentor
>  
> Mentor: Xin Liao, Apache Doris Committer, liaoxinbit@gmail.com
> Mentor: YongQiang Yang, Apache Doris PMC Member, dataroaring@gmail.com
> Mailing List: dev@doris.apache.org
> Website: https://doris.apache.org
> Source Code: https://github.com/apache/doris
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: gsoc-unsubscribe@community.apache.org
For additional commands, e-mail: gsoc-help@community.apache.org