You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/03/09 08:08:01 UTC

[GitHub] [incubator-doris] wangbo opened a new issue #3061: [Spark Load] Doris Support Using Hive Table to Build Global Dict

wangbo opened a new issue #3061: [Spark Load] Doris Support Using Hive Table to Build Global Dict
URL: https://github.com/apache/incubator-doris/issues/3061
 
 
   **Why Global Dict**
   Doris has already support using ```roaring bitmap``` to store bitmap for query ```Precise Count Distinct Measure```.
   But ```roaring bitmap```  just support ```int type```.
   If we want to support ```non-int type column ``` 's ```Precise Count Distinct```,
   We need a Global Dict to keep a map from ```non-int value``` and ```int value```.
   ```Global Dict```  is used to keep the map.
   
   **Implementation details**
   see pr #2940 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo closed issue #3061: [Spark Load] Doris Support Using Hive Table to Build Global Dict

Posted by GitBox <gi...@apache.org>.
wangbo closed issue #3061:
URL: https://github.com/apache/incubator-doris/issues/3061


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo edited a comment on issue #3061: [Spark Load] Doris Support Using Hive Table to Build Global Dict

Posted by GitBox <gi...@apache.org>.
wangbo edited a comment on issue #3061:
URL: https://github.com/apache/incubator-doris/issues/3061#issuecomment-642688333


   **Why gloal dict?**
   	We use ```roaringbitmap``` to compute distinct value.
   	But ```roaringbitmap```  's input value can only be integer.
   	So We need a data structure to store a map from origin value(can be any type) to integer value.
   	That is global dict.
   
   **How to build global dict?**
   For example:
   We have a hive table A, named ```source_hive_tableA```, content as below
   ```
   col1  bitmap_col_1
   1       'a' 
   2       'b'
   3       'c'
   4       'd'
   5       'd'
   ```
   
   **step1  make a group by table, named group_by_table_A**
   sql as below:
   ```
   insert overwrite group_by_table_A as select bitmap_col_1
    from source_hive_tableA group by bitmap_col_1
   ```
   result as below:
   ```
   bitamp_col_1
   'a'
   'b'
   'c'
   'd'
   ```
   
   **step2 make gloabl dict table,named dict_table_A**
   sql as below:
   ```
   insert overwrite table dict_table_A as select bitmap_col_1,
          value
     from dict_table_A
   union all select t1.bitamp_col_1,
          (row_number() over(order by t1.bitamp_col_1)) + (0)
     from (
           select bitamp_col_1
             from group_by_table_A
            where
              bitamp_col_1 is not null
          )t1
     left join (
           select bitmap_col_1,
                  value
             from dict_table_A
          )t2
       on t1.bitamp_col_1 = t2.bitmap_col_1
    where t2.value is null
   ```
   
   content as below:
   ```
   bitmap_col_1	        value
   'a'				 1
   'b'				 2
   'c'				 3
   'd'				 4
   ```
   we use window function to complete encode new values
   here (0) is the max value of ```dict_table_A```, for a empty ```dict_table_A ```, it is 0
   
   **step3 use dict_table_A to encode source_hive_tableA**
   ```
   insert overwrite source_hive_tableA
   as select 
   source_hive_tableA.col1, t1.value
   from
   source_hive_tableA
   left join(select bitamp_col_1,value from dict_table_A) t1
   on source_hive_tableA.bitmap_col_1 = t1.bitamp_col_1
   ```
   result as below
   ```
   col1          bitmap_col_1
   1			 1
   2			 2
   3			 3
   4			 4
   5			 4
   ```
   finally, the bitmap_col_1 is converted to integer type


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on issue #3061: [Spark Load] Doris Support Using Hive Table to Build Global Dict

Posted by GitBox <gi...@apache.org>.
wangbo commented on issue #3061:
URL: https://github.com/apache/incubator-doris/issues/3061#issuecomment-642688333


   **Why gloal dict?**
   	We use ```roaringbitmap``` to compute distinct value.
   	But ```roaringbitmap```  's input value can only be integer.
   	So We need a data structure to store a map from origin value(can be any type) to integer value.
   	That is global dict.
   
   **How to build global dict?**
   For example:
   We have a hive table A, named ```source_hive_tableA```, content as below
   ```
   col1  bitmap_col_1
   1       'a' 
   2       'b'
   3       'c'
   4       'd'
   5       'd'
   ```
   
   **step1  make a group by table, named group_by_table_A**
   sql as below:
   ```
   insert overwrite group_by_table_A as select bitmap_col_1
    from source_hive_tableA group by bitmap_col_1
   ```
   result as below:
   ```
   bitamp_col_1
   'a'
   'b'
   'c'
   'd'
   ```
   
   **step2 make gloabl dict table,named dict_table_A**
   sql as below:
   ```
   insert overwrite table dict_table_A as select bitmap_col_1,
          value
     from dict_table_A
   union all select t1.bitamp_col_1,
          (row_number() over(order by t1.bitamp_col_1)) + (0)
     from (
           select bitamp_col_1
             from group_by_table_A
            where
              bitamp_col_1 is not null
          )t1
     left join (
           select bitmap_col_1,
                  value
             from dict_table_A
          )t2
       on t1.bitamp_col_1 = t2.bitmap_col_1
    where t2.value is null
   ```
   
   content as below:
   ```
   bitmap_col_1	value
   'a'				 1
   'b'				 2
   'c'				 3
   'd'				 4
   ```
   we use window function to complete encode new values
   here (0) is the max value of ```dict_table_A```, for a empty ```dict_table_A ```, it is 0
   
   **step3 use dict_table_A to encode source_hive_tableA**
   ```
   insert overwrite source_hive_tableA
   as select 
   source_hive_tableA.col1, t1.value
   from
   source_hive_tableA
   left join(select bitamp_col_1,value from dict_table_A) t1
   on source_hive_tableA.bitmap_col_1 = t1.bitamp_col_1
   ```
   result as below
   ```
   col1      bitmap_col_1
   1			 1
   2			 2
   3			 3
   4			 4
   5			 4
   ```
   finally, the bitmap_col_1 is converted to integer type


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org