You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/05/30 03:33:48 UTC

[GitHub] [incubator-doris] starocean999 opened a new pull request, #9842: [Enhancement] global dict optimization for low cardinality string column

starocean999 opened a new pull request, #9842:
URL: https://github.com/apache/incubator-doris/pull/9842

   # Proposed changes
   
   Issue Number: close [#9838](https://github.com/apache/incubator-doris/issues/9838)
   
   ## Problem Summary:
   
   Group by operator is kind of time cosuming, especially for string column. However, some string column are low cardinality and can be dict encoded. We can take advantage of the dict encoded integer values in group by operator instead of the original strings to get much better performance.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: (No)
   2. Has unit tests been added: (Yes)
   3. Has document been added or modified: (No)
   4. Does it need to update dependencies: (No)
   5. Are there any changes that cannot be rolled back: (No)
   
   ## Further comments
   
   add low_cardinality keyword. use "alter table xxx modify column yyy low_cardinality true;" to set the column yyy as low cardinality. The the column would be dict encoded and in group by operator, we can use integer dict code instead of strings.
   
   Use ssb_flat 100G as an example. We add low_cardinality key word for some string columns and run test sqls as bellow:
   
   alter table lineorder_flat modify column lo_orderpriority low_cardinality true;
   alter table lineorder_flat modify column lo_shipmode low_cardinality true;
   alter table lineorder_flat modify column lo_orderdate low_cardinality true;
   alter table lineorder_flat modify column s_city low_cardinality true;
   alter table lineorder_flat modify column c_city low_cardinality true;
   alter table lineorder_flat modify column s_nation low_cardinality true;
   alter table lineorder_flat modify column s_region low_cardinality true;
   alter table lineorder_flat modify column c_nation low_cardinality true;
   alter table lineorder_flat modify column p_category low_cardinality true;
   alter table lineorder_flat modify column p_mfgr low_cardinality true;
   
   alter table lineorder modify column lo_orderpriority low_cardinality true;
   alter table lineorder modify column lo_shipmode low_cardinality true;
   alter table lineorder modify column lo_orderdate low_cardinality true;
   alter table supplier modify column s_city low_cardinality true;
   alter table customer modify column c_city low_cardinality true;
   alter table supplier modify column s_nation low_cardinality true;
   alter table supplier modify column s_region low_cardinality true;
   alter table customer modify column c_nation low_cardinality true;
   alter table part modify column p_category low_cardinality true;
   alter table part modify column p_mfgr low_cardinality true;
   
   --Q1
   select count(*),lo_shipmode from lineorder_flat group by lo_shipmode;
   --Q2
   select count(distinct lo_shipmode) from lineorder_flat;
   --Q3
   select count(*),lo_shipmode,lo_orderpriority from lineorder_flat group by lo_shipmode,lo_orderpriority;
   --Q4
   select count(*),lo_shipmode,lo_orderpriority from lineorder_flat group by lo_shipmode,lo_orderpriority,lo_shippriority;
   --Q5
   select count(*),lo_shipmode,s_city from lineorder_flat group by lo_shipmode,s_city;
   --Q6
   select count(*) from lineorder_flat group by c_city,s_city;
   --Q7
   select count(*) from lineorder_flat group by lo_shipmode,lo_orderdate;
   --Q8
   select count(*) from lineorder_flat group by lo_orderdate,s_nation,s_region;
   --Q9
   select count(*) from lineorder_flat group by c_city,s_city,c_nation,s_nation;
   --Q10
   select count(*) from (select count(*) from lineorder_flat group by lo_shipmode,lo_orderpriority,p_category,s_nation,c_nation) t;
   --Q11
   select count(*) from (select count(*) from lineorder_flat group by lo_shipmode,lo_orderpriority,p_category,s_nation,c_nation,p_mfgr) t;
   --Q12
   select count(*) from (select count(*) from lineorder_flat group by substr(lo_shipmode,2),lower(lo_orderpriority),p_category,s_nation,c_nation,s_region,p_mfgr) t;
   
   The test results are(in ms):
   query | low_cardinality=false | low_cardinality=true
   -- | -- | --
   q1 | 707 | 367
   q2 | 694 | 343
   q3 | 1206 | 641
   q4 | 1384 | 727
   q5 | 1502 | 779
   q6 | 7218 | 1993
   q7 | 1337 | 753
   q8 | 3264 | 1711
   q9 | 12868 | 2703
   q10 | 17895 | 7611
   q11 | 20533 | 8654
   q12 | 24342 | 14983
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r889982581


##########
be/src/runtime/descriptors.h:
##########
@@ -140,6 +146,8 @@ class SlotDescriptor {
     int _field_idx;
 
     const bool _is_materialized;
+    bool _is_global_dict_column = false;

Review Comment:
   rename it to  has_global_dict



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r889961460


##########
be/src/olap/rowset/beta_rowset.cpp:
##########
@@ -230,4 +230,14 @@ bool BetaRowset::check_file_exist() {
     return true;
 }
 
+Status BetaRowset::get_dict_data(std::set<std::string>& dict_words, int col_id) {
+    std::vector<segment_v2::SegmentSharedPtr> segments;
+    RETURN_NOT_OK(load_segments(&segments));
+    for (auto& seg_ptr : segments) {
+        Status status = seg_ptr->get_dict_data(dict_words, col_id);
+        if (!status.ok()) return status;

Review Comment:
   use macro RETURN_NOT_OK(status)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [Enhancement] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r884409635


##########
gensrc/thrift/PaloInternalService.thrift:
##########
@@ -278,15 +278,17 @@ struct TTxnParams {
 }
 
 // Definition of global dict, global dict is used to accelerate query performance of low cardinality data
-struct TColumnDict {
-  1: optional Types.TPrimitiveType type
-  2: list<string> str_dict  // map one string to a integer, using offset as id
-}
+// struct TColumnDict {
+//   1: optional Types.TPrimitiveType type
+//   2: list<string> str_dict  // map one string to a integer, using offset as id
+// }
 
-struct TGlobalDict {
-  1: optional map<i32, TColumnDict> dicts,  // map dict_id to column dict
-  2: optional map<i32, i32> slot_dicts // map from slot id to column dict id, because 2 or more column may share the dict
-}
+// ExecPlanFragment
+

Review Comment:
   这种代码,如果删除了就删除好了,不要这种注释了,以后还得有人清理



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r889959785


##########
be/src/olap/delta_writer.cpp:
##########
@@ -320,6 +321,41 @@ Status DeltaWriter::close_wait() {
         return res;
     }
 
+    if (invalid_dict_column_names) {
+        invalid_dict_column_names->clear();
+        std::vector<const TabletColumn*> dict_columns;
+        for (const auto& column : _tablet_schema->columns()) {
+            FieldType column_type = column.type();
+            if (_req.dicts != nullptr && (column_type == FieldType::OLAP_FIELD_TYPE_CHAR ||
+                                          column_type == FieldType::OLAP_FIELD_TYPE_VARCHAR ||
+                                          column_type == FieldType::OLAP_FIELD_TYPE_STRING)) {
+                auto iter = _req.dicts->find(column.name().data());
+                if (iter != _req.dicts->end()) {
+                    dict_columns.push_back(&column);
+                }
+            }
+        }
+
+        auto beta_rowset = dynamic_cast<BetaRowset*>(_cur_rowset.get());
+        assert(beta_rowset);
+        std::set<std::string> dict_words;
+        for (const TabletColumn* column : dict_columns) {
+            auto iter = _req.dicts->find(column->name().data());
+            assert(iter != _req.dicts->end());
+            dict_words.clear();
+            Status status = beta_rowset->get_dict_data(dict_words, column->unique_id());

Review Comment:
   If there are too many new dict words in column, get dict data may failed to get dict and return error status. If you return error status here, the load task is failed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r889967266


##########
be/src/olap/rowset/segment_v2/column_reader.cpp:
##########
@@ -375,6 +375,79 @@ Status ColumnReader::new_iterator(ColumnIterator** iterator) {
     }
 }
 
+/**
+ * @brief check if all pages encoded by local dict
+ * if the last page is encoded by dict, all pages are encoded by dict
+ * 
+ * @return true if 
+ * @return false 
+ */
+bool ColumnReader::all_pages_encoded_by_dict(ColumnIteratorOptions iter_opts, BlockCompressionCodec* codec) {
+    // go to the last page
+    RETURN_IF_ERROR(_ensure_index_loaded());
+    OrdinalPageIndexIterator last_iter(_ordinal_index.get(), _ordinal_index->num_data_pages() - 1);
+    PageHandle handle;
+    Slice page_body;
+    PageFooterPB footer;
+    iter_opts.type = DATA_PAGE;
+    ParsedPage parsed_page;
+    RETURN_IF_ERROR(read_page(iter_opts, last_iter.page(), &handle, &page_body, &footer, codec));
+    // parse data page
+    RETURN_IF_ERROR(ParsedPage::create(std::move(handle), page_body, footer.data_page_footer(),

Review Comment:
   Are you sure this code could compiled? This line will return error status not boolean.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r890057698


##########
be/src/vec/exec/volap_scan_node.h:
##########
@@ -48,6 +49,7 @@ class VOlapScanNode final : public OlapScanNode {
     Status _add_blocks(std::vector<Block*>& block);
     int _start_scanner_thread_task(RuntimeState* state, int block_per_scanner);
     Block* _alloc_block(bool& get_free_block);
+    Status _do_dict_encode(Block& block) const;

Review Comment:
   This method is useless?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] closed pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #9842: [feature] global dict optimization for low cardinality string column
URL: https://github.com/apache/doris/pull/9842


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r889975303


##########
be/src/olap/rowset/segment_v2/column_reader.cpp:
##########
@@ -375,6 +375,79 @@ Status ColumnReader::new_iterator(ColumnIterator** iterator) {
     }
 }
 
+/**
+ * @brief check if all pages encoded by local dict
+ * if the last page is encoded by dict, all pages are encoded by dict
+ * 
+ * @return true if 
+ * @return false 
+ */
+bool ColumnReader::all_pages_encoded_by_dict(ColumnIteratorOptions iter_opts, BlockCompressionCodec* codec) {
+    // go to the last page
+    RETURN_IF_ERROR(_ensure_index_loaded());
+    OrdinalPageIndexIterator last_iter(_ordinal_index.get(), _ordinal_index->num_data_pages() - 1);
+    PageHandle handle;
+    Slice page_body;
+    PageFooterPB footer;
+    iter_opts.type = DATA_PAGE;
+    ParsedPage parsed_page;
+    RETURN_IF_ERROR(read_page(iter_opts, last_iter.page(), &handle, &page_body, &footer, codec));
+    // parse data page
+    RETURN_IF_ERROR(ParsedPage::create(std::move(handle), page_body, footer.data_page_footer(),
+                                       encoding_info(), last_iter.page(), last_iter.page_index(),
+                                       &parsed_page));
+
+    auto dict_page_decoder = reinterpret_cast<BinaryDictPageDecoder*>(parsed_page.data_decoder);
+    bool all_dict = false;
+    if (dict_page_decoder) {
+        all_dict = dict_page_decoder->is_dict_encoding();
+    }
+    return all_dict;
+}
+
+Status ColumnReader::get_dict_data(std::set<string>& dict_words) {
+    if (encoding_info()->encoding() == DICT_ENCODING) {
+        std::unique_ptr<BlockCompressionCodec> compress_codec;
+        RETURN_IF_ERROR(get_block_compression_codec(_meta.compression(), compress_codec));
+        assert(compress_codec);
+        Slice dict_slice;
+        ColumnIteratorOptions iter_opts;
+        std::unique_ptr<fs::ReadableBlock> rblock;
+        fs::BlockManager* block_mgr = fs::fs_util::block_manager(_path_desc);
+        RETURN_IF_ERROR(block_mgr->open_block(_path_desc, &rblock));
+        iter_opts.rblock = rblock.get();
+        iter_opts.type = INDEX_PAGE;
+        iter_opts.stats = new OlapReaderStatistics();
+        if (all_pages_encoded_by_dict(iter_opts, compress_codec.get())) {
+            PageHandle dict_page_handle;
+            PageFooterPB dict_footer;
+            read_page(iter_opts, get_dict_page_pointer(), &dict_page_handle, &dict_slice,
+                      &dict_footer, compress_codec.get());
+            auto dict_decoder = std::make_unique<BinaryPlainPageDecoder>(dict_slice);
+            RETURN_IF_ERROR(dict_decoder->init());
+
+            auto* pd_decoder = (BinaryPlainPageDecoder*)dict_decoder.get();
+            StringRef* dict_word_info = new StringRef[pd_decoder->_num_elems];

Review Comment:
   using unique_ptr instead of raw ptr to avoid forgetting delete[]



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r889960425


##########
be/src/olap/delta_writer.cpp:
##########
@@ -320,6 +321,41 @@ Status DeltaWriter::close_wait() {
         return res;
     }
 
+    if (invalid_dict_column_names) {
+        invalid_dict_column_names->clear();
+        std::vector<const TabletColumn*> dict_columns;
+        for (const auto& column : _tablet_schema->columns()) {
+            FieldType column_type = column.type();
+            if (_req.dicts != nullptr && (column_type == FieldType::OLAP_FIELD_TYPE_CHAR ||
+                                          column_type == FieldType::OLAP_FIELD_TYPE_VARCHAR ||
+                                          column_type == FieldType::OLAP_FIELD_TYPE_STRING)) {
+                auto iter = _req.dicts->find(column.name().data());
+                if (iter != _req.dicts->end()) {
+                    dict_columns.push_back(&column);
+                }
+            }
+        }
+
+        auto beta_rowset = dynamic_cast<BetaRowset*>(_cur_rowset.get());
+        assert(beta_rowset);
+        std::set<std::string> dict_words;
+        for (const TabletColumn* column : dict_columns) {
+            auto iter = _req.dicts->find(column->name().data());
+            assert(iter != _req.dicts->end());
+            dict_words.clear();
+            Status status = beta_rowset->get_dict_data(dict_words, column->unique_id());

Review Comment:
   I think get dict data could not affect normal load job. Maybe you could just put the column to invalid columns if failed to get dict data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r889996691


##########
be/src/olap/rowset/segment_v2/options.h:
##########
@@ -18,6 +18,7 @@
 #pragma once
 
 #include <cstddef>
+#include <parallel_hashmap/phmap.h>

Review Comment:
   useless change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r890049772


##########
be/src/runtime/runtime_state.cpp:
##########
@@ -467,4 +467,38 @@ int64_t RuntimeState::get_load_mem_limit() {
     }
 }
 
+void RuntimeState::set_global_dicts(const std::shared_ptr<TGlobalDict>& tglobal_dict) {
+    if (tglobal_dict) {
+        for (auto it = tglobal_dict->dicts.begin(); it != tglobal_dict->dicts.end(); it++) {
+            _dict_id_to_global_dict_map[it->first] =
+                    std::make_shared<vectorized::GlobalDict>(it->second.str_dict);
+        }
+        for (auto it = tglobal_dict->slot_dicts.begin(); it != tglobal_dict->slot_dicts.end();
+             it++) {
+            assert(_dict_id_to_global_dict_map.find(it->second) !=
+                   _dict_id_to_global_dict_map.end());
+            _slot_id_to_global_dict_map[it->first] = _dict_id_to_global_dict_map[it->second];
+        }
+    }
+}
+
+vectorized::GlobalDictSPtr RuntimeState::get_global_dict(int slot_id) {
+    assert(_slot_id_to_global_dict_map.find(slot_id) != _slot_id_to_global_dict_map.end());
+    return _slot_id_to_global_dict_map[slot_id];
+}
+
+vectorized::GlobalDictSPtr RuntimeState::find_global_dict(int slot_id) {

Review Comment:
   find global dict is duplicated with get_global_dict please merge these 2 method to 1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [Enhancement] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r884410114


##########
fe/pom.xml:
##########
@@ -953,6 +954,11 @@ under the License.
                 <artifactId>mariadb-java-client</artifactId>
                 <version>${mariadb-java-client.version}</version>
             </dependency>
+			<dependency>
+	            <groupId>org.apache.arrow</groupId>

Review Comment:
   This is useless any more. Delete it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r890053314


##########
be/src/runtime/tablets_channel.cpp:
##########
@@ -234,6 +250,7 @@ Status TabletsChannel::_open_all_writers(const PTabletWriterOpenRequest& request
         wrequest.tuple_desc = _tuple_desc;
         wrequest.slots = index_slots;

Review Comment:
   _dicts is constructed from slots, but wrequest already has a field slots so that I think the node could parse dict from wrequest's slots field. No need to create dicts here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r890066425


##########
be/src/vec/columns/column_dict_encoded_string.h:
##########
@@ -0,0 +1,52 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "vec/columns/column_vector.h"
+#include "vec/common/assert_cast.h"
+
+using DictId = int32_t;
+
+namespace doris::vectorized {
+
+template <typename T>
+class ColumnDictEncodedString final : public ColumnVector<T> {

Review Comment:
   Is this class useful?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r890057373


##########
be/src/vec/exec/volap_scan_node.cpp:
##########
@@ -34,6 +34,32 @@ VOlapScanNode::VOlapScanNode(ObjectPool* pool, const TPlanNode& tnode, const Des
     _free_blocks.reserve(_max_materialized_blocks);
 }
 
+Status VOlapScanNode::init(const TPlanNode& tnode, RuntimeState* state) {
+    assert(state);

Review Comment:
   Do not need to check here. If state is not set, then many method will crashed. We do not need to check every condition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r889979151


##########
be/src/olap/tablet.cpp:
##########
@@ -1531,4 +1532,24 @@ std::shared_ptr<MemTracker>& Tablet::get_compaction_mem_tracker(CompactionType c
     }
 }
 
+Status Tablet::get_dict_data(std::set<std::string>& dict_words, int col_id) {
+    std::vector<BetaRowset*> beta_row_sets;

Review Comment:
   Do not use raw ptr here. Because if you use raw ptr, the rowset maybe released before you use it. May core!!!!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r889965334


##########
be/src/olap/rowset/beta_rowset_writer.cpp:
##########
@@ -320,10 +320,9 @@ Status BetaRowsetWriter::_create_segment_writer(
     }
 
     DCHECK(wblock != nullptr);

Review Comment:
   need not change this code, it does not matter whether it is a local variable or a member variable.



##########
be/src/olap/rowset/beta_rowset_writer.h:
##########
@@ -100,6 +101,8 @@ class BetaRowsetWriter : public RowsetWriter {
 
     bool _is_pending = false;
     bool _already_built = false;
+
+    segment_v2::SegmentWriterOptions _writer_options;

Review Comment:
   useless member variable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #9842:
URL: https://github.com/apache/doris/pull/9842#issuecomment-1378060444

   We're closing this PR because it hasn't been updated in a while.
   This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and feel free a maintainer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r889982882


##########
be/src/runtime/descriptors.cpp:
##########
@@ -90,7 +90,11 @@ void SlotDescriptor::to_protobuf(PSlotDescriptor* pslot) const {
 vectorized::MutableColumnPtr SlotDescriptor::get_empty_mutable_column() const {
     auto data_type = get_data_type_ptr();
     if (data_type) {
-        return data_type->create_column();
+        auto column_ptr = data_type->create_column();
+        if (is_global_dict_column()) {

Review Comment:
   rename to has_global_dict()



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r890058030


##########
be/src/vec/exec/volap_scan_node.h:
##########
@@ -69,6 +71,8 @@ class VOlapScanNode final : public OlapScanNode {
     int _max_materialized_blocks;
 
     size_t _block_size = 0;
+
+    std::map<int, GlobalDictSPtr> _dicts;

Review Comment:
   Is this field useful?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r890060186


##########
be/src/vec/runtime/dict/global_dict.cpp:
##########
@@ -0,0 +1,159 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "global_dict.h"
+
+#include "vec/columns/column_nullable.h"
+#include "vec/data_types/data_type_number.h"
+#include "vec/data_types/data_type_nullable.h"
+#include "vec/data_types/data_type_string.h"
+
+namespace doris::vectorized {
+
+GlobalDict::GlobalDict(const std::vector<std::string>& data) : Dict(data) {
+    _dict_data = data;
+    for (const auto& val : _dict_data) {
+        StringValue v {val.data(), (int)val.size()};
+        insert_value(v);
+    }
+}
+
+bool GlobalDict::encode(ColumnWithTypeAndName& col) {
+    // encode method is only used by unit test

Review Comment:
   Is encode method ever used?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r890058605


##########
be/src/vec/runtime/dict/dict_mgr.h:
##########
@@ -0,0 +1,80 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file

Review Comment:
   This file is useless, please do not commit it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r889983765


##########
be/src/runtime/runtime_state.cpp:
##########
@@ -467,4 +467,38 @@ int64_t RuntimeState::get_load_mem_limit() {
     }
 }
 
+void RuntimeState::set_global_dicts(const std::shared_ptr<TGlobalDict>& tglobal_dict) {
+    if (tglobal_dict) {
+        for (auto it = tglobal_dict->dicts.begin(); it != tglobal_dict->dicts.end(); it++) {
+            _dict_id_to_global_dict_map[it->first] =
+                    std::make_shared<vectorized::GlobalDict>(it->second.str_dict);
+        }
+        for (auto it = tglobal_dict->slot_dicts.begin(); it != tglobal_dict->slot_dicts.end();
+             it++) {
+            assert(_dict_id_to_global_dict_map.find(it->second) !=
+                   _dict_id_to_global_dict_map.end());
+            _slot_id_to_global_dict_map[it->first] = _dict_id_to_global_dict_map[it->second];
+        }
+    }
+}
+
+vectorized::GlobalDictSPtr RuntimeState::get_global_dict(int slot_id) {

Review Comment:
   GlobalDictsPtr



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9842: [feature] global dict optimization for low cardinality string column

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9842:
URL: https://github.com/apache/incubator-doris/pull/9842#discussion_r889981500


##########
be/src/olap/tablet.cpp:
##########
@@ -1531,4 +1532,24 @@ std::shared_ptr<MemTracker>& Tablet::get_compaction_mem_tracker(CompactionType c
     }
 }
 
+Status Tablet::get_dict_data(std::set<std::string>& dict_words, int col_id) {
+    std::vector<BetaRowset*> beta_row_sets;
+    {
+        std::shared_lock rdlock(get_header_lock());
+        for (auto entry : _rs_version_map) {

Review Comment:
   std::shared_lock rdlock(get_header_lock());
           for (auto entry : _rs_version_map) {
          RETURN_IF_NOT_OK( rs->get_dict_data(dict_words, col_id));
               }
         



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org