You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/12/24 10:09:27 UTC

[GitHub] [incubator-doris] chaoyli opened a new issue #5142: Optimize nested type implementation

chaoyli opened a new issue #5142:
URL: https://github.com/apache/incubator-doris/issues/5142


   Array is used to store data like user labels.
   made up of (offset, element), element can be an Array.
   [1, 2, 3], [4, 5, 6]
   Doris have implement array nest type. But it's not extensible.
   ```
   1.  It store NULL with offset, which only be feasible to Array, but not to Struct.
        Struct is made of {element 1, element2}, but not has implicit offset column.
        Array should store null as a new column, not attaches to offset column.
   2. It stores offset with absolute ordinal, so every offset ColumnWriter will have to 
       callback ArrayColumnWriter to record the position. The logic is so trick to understand.
   3. When read the ArrayColumn, it have to seek to next position to get the right end offset.
   ```
   So I think it to better to distinguish the memory layout and disk layout.
   ```
   [[1, 2], [3, 4]], [[5, 6, 7], null, [8]], [[9, 10]]
   
   * Offsets buffer (int32)
   
     | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 | Bytes 12-15 | 
     |------------|------------|------------|-------------|
     | 0          |  2         |  5         |  6          |
   
     * Values array (`List<Int8>`)
   
       | Byte 0 (validity bitmap) | Bytes 1-63  |
       |--------------------------|-------------|
       | 00110111                 | 0 (padding) |
   
     * Offsets buffer (int32)
   
       | Bytes 0-27           | 
       |----------------------|
       | 0, 2, 4, 7, 7, 8, 10 |
   
       * Elements array (Int8):
   
         | Bytes 0-9                     |
         |-------------------------------|
         | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 |
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] sduzh commented on issue #5142: Optimize nested type implementation

Posted by GitBox <gi...@apache.org>.
sduzh commented on issue #5142:
URL: https://github.com/apache/incubator-doris/issues/5142#issuecomment-751181703


   > 目前的`get_type_info()`接口只支持一维数组,无法支持多维嵌套结构
   
   为了能够支持嵌套数组,以及struct/map等复杂类型,需要将当前的`get_type_info`内部实现修改为动态创建,并通过智能指针来管理生命周期
   ```
   std::unique_ptr<TypeInfo> get_type_info(const Field& field);
   std::unique_ptr<TypeInfo> get_type_info(const TabletColumn& column);
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] chaoyli commented on issue #5142: Optimize nested type implementation

Posted by GitBox <gi...@apache.org>.
chaoyli commented on issue #5142:
URL: https://github.com/apache/incubator-doris/issues/5142#issuecomment-750838601


   The disk layout should take encoding algorithm and reading efficiency into consideration.
   So the disk layout will store the size without the absolute for every array element.
   ```
   The disk layout
   * First Level Offsets (int32), First Level have no nulls
   
     | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 |
     |------------|------------|------------|
     | 2         |  3         |  1        |
   
     * Second Nulls (uint8)
   
       | Byte 1- 3 | Bytes 4  | Bytes 5 - 7|
       |-----------------------------------|
       | 0         | 1        | 0          |
   
     * Second Level Offsets buffer (int32)
   
       | Bytes 0-23           |
       |----------------------|
       | 2, 2, 3, 0, 1, 2 |
   
     * Elements array (int32):
   
       | Bytes 0-9                     |
       |-------------------------------|
       | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 |
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] chaoyli edited a comment on issue #5142: Optimize nested type implementation

Posted by GitBox <gi...@apache.org>.
chaoyli edited a comment on issue #5142:
URL: https://github.com/apache/incubator-doris/issues/5142#issuecomment-750837917


   The array will support the **array_slice, flatten, a[5]** function.
   This function should to eliminate the memory copy in Array. 
   So the memory layout will store the data in bottom level. 
   Every function will return a new Array and only change set the offsets.
   ```
   The memory layout:
    
   Two-level data : [[1, 2], [3, 4]], [[5, 6, 7], null, [8]], [[9, 10]]
   
   * First Level Offsets (int32), First Level have no nulls
   
     | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 | Bytes 12-15 |
     |------------|------------|------------|-------------|
     | 0          |  2         |  5         |  6          |
   
     * Second Nulls (uint8)
   
       | Byte 1- 3 | Bytes 4  | Bytes 5 - 7|
       |-----------------------------------|
       | 0         | 1        | 0          |
   
     * Second Level Offsets (int32)
   
       | Bytes 0-27           |
       |----------------------|
       | 0, 2, 4, 7, 7, 8, 10 |
   
     * Elements array (int32):
   
       | Bytes 0-9                     |
       |-------------------------------|
       | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 |
   ```
   Like the above, every level in array only stores nulls and offsets.
   And also, offsets will store zero to calculate the first size.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] chaoyli edited a comment on issue #5142: Optimize nested type implementation

Posted by GitBox <gi...@apache.org>.
chaoyli edited a comment on issue #5142:
URL: https://github.com/apache/incubator-doris/issues/5142#issuecomment-750837917


   The array will support the **array_contains, array_slice, flatten, a[5]** function.
   This function should to eliminate the memory copy in Array. 
   So the memory layout will store the data in bottom level.
   Every function will return a new Array.
   ```
   The memory layout:
    
   Two-level data : [[1, 2], [3, 4]], [[5, 6, 7], null, [8]], [[9, 10]]
   
   * First Level Offsets (int32), First Level have no nulls
   
     | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 | Bytes 12-15 |
     |------------|------------|------------|-------------|
     | 0          |  2         |  5         |  6          |
   
     * Second Nulls (uint8)
   
       | Byte 1- 3 | Bytes 4  | Bytes 5 - 7|
       |-----------------------------------|
       | 0         | 1        | 0          |
   
     * Second Level Offsets (int32)
   
       | Bytes 0-27           |
       |----------------------|
       | 0, 2, 4, 7, 7, 8, 10 |
   
     * Elements array (int32):
   
       | Bytes 0-9                     |
       |-------------------------------|
       | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 |
   ```
   Like the above, every level in array only stores nulls and offsets.
   And also, offsets will store zero to calculate the first size.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] chaoyli commented on issue #5142: Optimize nested type implementation

Posted by GitBox <gi...@apache.org>.
chaoyli commented on issue #5142:
URL: https://github.com/apache/incubator-doris/issues/5142#issuecomment-750837917


   To better understand and reduce copy in memory for **array_contains, array_slice, flatten** function.
   It's better to distinguish the memory layout and disk layout.
   ```
   The memory layout:
    
   Two-level data : [[1, 2], [3, 4]], [[5, 6, 7], null, [8]], [[9, 10]]
   
   * First Level Offsets (int32), First Level have no nulls
   
     | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 | Bytes 12-15 |
     |------------|------------|------------|-------------|
     | 0          |  2         |  5         |  6          |
   
     * Second Nulls (uint8)
   
       | Byte 1- 3 | Bytes 4  | Bytes 5 - 7|
       |-----------------------------------|
       | 0         | 1        | 0          |
   
     * Second Level Offsets (int32)
   
       | Bytes 0-27           |
       |----------------------|
       | 0, 2, 4, 7, 7, 8, 10 |
   
     * Elements array (int32):
   
       | Bytes 0-9                     |
       |-------------------------------|
       | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 |
   ```
   Like the above, every level in array only stores nulls and offsets.
   And also, offsets will store zero to calculate the first size.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] sduzh commented on issue #5142: Optimize nested type implementation

Posted by GitBox <gi...@apache.org>.
sduzh commented on issue #5142:
URL: https://github.com/apache/incubator-doris/issues/5142#issuecomment-751181271


   目前的`get_type_info()`接口只支持一维数组,无法支持多维嵌套结构


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] chaoyli commented on issue #5142: Optimize nested type implementation

Posted by GitBox <gi...@apache.org>.
chaoyli commented on issue #5142:
URL: https://github.com/apache/incubator-doris/issues/5142#issuecomment-752349268


   I divided into two function
   ```
   using TypeInfoPtr = std::shared_ptr<TypeInfo>
   const TypeInfoPtr& get_type_info(FieldType type)
   TypeInfoPtr get_type_info(const TabletColumn& column);
   ```
   The first interface is used to scalar type, and the second interface is used to all type including nested type(Array/Struct/Map).
   To alleviate the copy of shared_ptr, the scalar get_type_info interface will return a const reference of TypeInfo.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] chaoyli edited a comment on issue #5142: Optimize nested type implementation

Posted by GitBox <gi...@apache.org>.
chaoyli edited a comment on issue #5142:
URL: https://github.com/apache/incubator-doris/issues/5142#issuecomment-750838601


   The disk layout should take encoding algorithm and reading efficiency into consideration.
   So the disk layout will store the array_size without the absolute for every array element.
   ```
   The disk layout
   * First Level array_sizes (int32), First Level have no nulls
   
     | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 |
     |------------|------------|------------|
     | 2         |  3         |  1        |
   
     * Second Nulls (uint8)
   
       | Byte 1- 3 | Bytes 4  | Bytes 5 - 7|
       |-----------------------------------|
       | 0         | 1        | 0          |
   
     * Second Level array_sizes (int32)
   
       | Bytes 0-23           |
       |----------------------|
       | 2, 2, 3, 0, 1, 2 |
   
     * Elements array (int32):
   
       | Bytes 0-9                     |
       |-------------------------------|
       | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 |
   ```
   For null, array_size, element array, it will construct the specified ColumnWriter to write the data.
   The ArrayColumnWriter only need to call the three writers.
   ```
   1. Call writer to write nulls.
   2. Call  writer to write array_sizes. And also add a new meta to record the corresponding relation between array_size ordinal 
        and element ordinal.
   3. Call element writer recursively.
   ``` 
   
   Upon read, when seek to specified the ordinal, it will seek the null, array_size, element separately.
   When seek the element column, it will get the start ordinal from array_size reader.
   Because the array_size have seek to specified ordinal, so It only needs one sum.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yangzhg closed issue #5142: Optimize nested type implementation

Posted by GitBox <gi...@apache.org>.
yangzhg closed issue #5142:
URL: https://github.com/apache/incubator-doris/issues/5142


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org