You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/02/12 01:39:01 UTC

[GitHub] [incubator-doris] liutang123 opened a new issue #2885: [Proposal] Restructure storage type to support complex types expending

liutang123 opened a new issue #2885: [Proposal] Restructure storage type to support complex types expending
URL: https://github.com/apache/incubator-doris/issues/2885
 
 
   **需求**
   #2871 的实现需要存储部分支持array,当前数据格式不能很好地拓展为复杂类型,因此需要对存储格式进行重构,方便复杂类型的实现。
   
   **目标**
   修改元数据、重构存储格式,使用树表达数据类型,方便复杂类型的拓展。
   
   **Schema**
   Tablet Meta:
   ```
   message ColumnPB {
     required string type; // TPrimitiveType::type,多值列为LIST
     repeated ColumnPB childrenColumn;
     repeated string childrenColumnNames;
   }
   ```
   ```
   TabletColumn {
     _type: FieldType; //类型
     TabletColumn* parent; //父列指针
     std::vector<std::unique_ptr<TabletColumn>> subTypes; // 子列类型
     std::vector<std::string> fieldNames; // 子列名称
     uint32_t getSubtypeCount() const = 0; //子列数量
   }
   ```
   ```
   Field {
     TabletColumn
     _type_info: TypeInfo;
     _key_coder: KeyCoder;
     Field* parent; //父列类型指针
     std::vector<std::unique_ptr<Field>> subFields; // 子列类型
     std::vector<std::string> fieldNames; // 子列类型名称
     uint32_t getSubtypeCount() const = 0; //子列类型数量
   }
   ```
   通过FieldFactory创建不同类型的Field。
   树形TypeInfo:TypeInfo用来表示类型的比较、拷贝等操作,对于复杂类型需要其知道子列类型。
   
   segment footer:
   ```
   message ColumnMetaPB {
     required int32 type; // FieldType 类型
     repeated ColumnMetaPB childrenColumn; //孩子类型
     repeated string childrenColumnNames;
   }
   ```
   **写数据**
   树形ColumnWriter:
   每种Field对应一种ColumnWriter。
   使用何种编码方式由ColumnWriter的_page_builder确认。
   ColumnWriter中Field修改为树形。
   当前page的Data Region存储数据如下:
   ```
                    +----------------+
                    |  first row id  |
                    |----------------|
                    |   value count  |
                    |----------------|
                    | bitmap length  |
                    |----------- ----|
                    |  null bitmap   |
                    |----------------|
                    |     data       |
                    |----------------|
                    |    checksum    |
                    +----------------+
   ```
   对于list类型可以通过复用Long类型的PageBuilder进行写出。
   
   [1,2,3,4],[5,6,7],[],[8]
   
   page中数据存储数据如下:
   
   offsets: RLEv2[0 4 7 7] // 各个数组元素的长度,这里,我们只存储[8]的start offset,无法获得其长度,因此,需要对Page的格式进行改造,增加可以拓展的信息,来记录额外的信息(不兼容1)。
   子序列数据存储到子列的page中,即父列和子列占用独立的Data Region。
   由于子列的长度比父列要长,因此在ordinal index中,ordinal需要使用uint64表示,同之前的rowid_t区别(不兼容2)
   
   **读数据**
   同写入类似,抽象ColumnReader(仅在segment_v2上支持),对于不同的类型生成不同的ColumnReader。
   目前的ColumnReader包含ColumnMetaPB属性,ColumnMetaPB修改为树形。
   
   以LIST为例,
   ListColumnReader包含一个孩子ColumnReader。
   每次读取数据,将数据放于RowBlockV2中,不同的ColumnReader生成不同的FileColumnIterator子类,用于不同的读取数据。
   RowBlockV2存在如下问题:
   1. 当前RowBlockV2的长度固定(默认1024)
   2. 每列只是一个数组,无法存储子列数据,因此也就无法读取复杂类型。
   解决方法:将RowBlockV2中的_column_datas和_column_null_bitmaps封装为类似ORC的ColumnVectorBatch。
   
   读取RowBlockV2后需要转化成RowBlock,RowBlock中存储的C++类型如下:
   ```
   Strut collection {
   	// 数组⻓度
   	size_t length; 
   	// null bitmap
   	uint8_t* null_bitmap_data;
   	// 元素数据AnyVal数据
   	void* data 
   }
   ```
   collection大小固定,因此只需将collection放置于RowBlock的槽位中即可。

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] liutang123 commented on issue #2885: [Proposal] Restructure storage type to support complex types expending

Posted by GitBox <gi...@apache.org>.
liutang123 commented on issue #2885: [Proposal] Restructure storage type to support complex types expending
URL: https://github.com/apache/incubator-doris/issues/2885#issuecomment-584996542
 
 
   cc @Seaven @kangpinghuang @imay @gaodayue 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org