You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/02/24 11:19:53 UTC

[GitHub] [incubator-doris] gaodayue edited a comment on issue #2953: [segment_v2] Switch to Unified and Extensible Page Format

gaodayue edited a comment on issue #2953: [segment_v2] Switch to Unified and Extensible Page Format
URL: https://github.com/apache/incubator-doris/pull/2953#issuecomment-590273942
 
 
   @chaoyli answers inline
   
   > Bitmap index have save dict_column and bitmap with four page, every read have four I/O, this may be consuming. I think dict_column store a bitmap PagePointer will be better. Like this:
   DictColumnValueIndex : val -> dict_column_page_pointer
   DictColumnPage : dict_column|bitmap_page_pointer
   BitmapPage : RoaringBitMap
   which can save one I/O operation
   
   Storing bitmap_page_pointer in DictColumnPage has several drawbacks
   1. Binary search inside DictColumnPage is more complicated because we now need to separate `dict_column` and `bitmap_page_pointer`
   2. Storage size is not necessarily reduced because now we need to store a bitmap page pointer for each dictionary item. Previously we only store a bitmap page pointer for each bitmap page.
   3. The implementation is more complicated because now we tightly couples DictColumn with BitmapColumn. We lose the benefits of IndexedColumn abstraction.
   
   > I think ZoneMap and OrdinalIndex Read/Writer logic remaining the same may be better.
   Firstly ZoneMap and OrdinalIndex is simple, may not need to used IndexedColumnWriter/Reader complicated logic.
   Secondly IndexedColumnWriter will also contain all index writer, if we add BTree index in the futher.
   Thirdly if use the above optimization, the ZoneMap and OrdinalIndex also not suitable.
   
   The problems with implementing all kinds of indexes from scratch instead of reusing existing abstractions is lower code reusability and higher long term maintenance cost. The nice thing about `IndexedColumn` is that it can be used as the building blocks for all kinds of data and indexes, leading to a more layered system. Considering you worries about the cost of using BTree index for ZoneMap, I think IndexedColumn can support both single-level and multiple-level index in in the future.
   
   > I found the default decoding of VARCHAR and CHAR is dictionary encoding without policy, this may be consuming space when cardinality is high. And if we want to change it, we should rebuild all of the data.
   
   Actually the current implementation of `BinaryDictPageBuilder` will fallback to plain encoding automatically when it found the cardinality is high and the size of dictionary page is too big.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org