You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/03/09 08:08:01 UTC
[GitHub] [incubator-doris] wangbo opened a new issue #3061: [Spark Load]
Doris Support Using Hive Table to Build Global Dict
wangbo opened a new issue #3061: [Spark Load] Doris Support Using Hive Table to Build Global Dict
URL: https://github.com/apache/incubator-doris/issues/3061
**Why Global Dict**
Doris has already support using ```roaring bitmap``` to store bitmap for query ```Precise Count Distinct Measure```.
But ```roaring bitmap``` just support ```int type```.
If we want to support ```non-int type column ``` 's ```Precise Count Distinct```,
We need a Global Dict to keep a map from ```non-int value``` and ```int value```.
```Global Dict``` is used to keep the map.
**Implementation details**
see pr #2940
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [incubator-doris] wangbo closed issue #3061: [Spark Load] Doris Support Using Hive Table to Build Global Dict
Posted by GitBox <gi...@apache.org>.
wangbo closed issue #3061:
URL: https://github.com/apache/incubator-doris/issues/3061
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [incubator-doris] wangbo edited a comment on issue #3061: [Spark Load] Doris Support Using Hive Table to Build Global Dict
Posted by GitBox <gi...@apache.org>.
wangbo edited a comment on issue #3061:
URL: https://github.com/apache/incubator-doris/issues/3061#issuecomment-642688333
**Why gloal dict?**
We use ```roaringbitmap``` to compute distinct value.
But ```roaringbitmap``` 's input value can only be integer.
So We need a data structure to store a map from origin value(can be any type) to integer value.
That is global dict.
**How to build global dict?**
For example:
We have a hive table A, named ```source_hive_tableA```, content as below
```
col1 bitmap_col_1
1 'a'
2 'b'
3 'c'
4 'd'
5 'd'
```
**step1 make a group by table, named group_by_table_A**
sql as below:
```
insert overwrite group_by_table_A as select bitmap_col_1
from source_hive_tableA group by bitmap_col_1
```
result as below:
```
bitamp_col_1
'a'
'b'
'c'
'd'
```
**step2 make gloabl dict table,named dict_table_A**
sql as below:
```
insert overwrite table dict_table_A as select bitmap_col_1,
value
from dict_table_A
union all select t1.bitamp_col_1,
(row_number() over(order by t1.bitamp_col_1)) + (0)
from (
select bitamp_col_1
from group_by_table_A
where
bitamp_col_1 is not null
)t1
left join (
select bitmap_col_1,
value
from dict_table_A
)t2
on t1.bitamp_col_1 = t2.bitmap_col_1
where t2.value is null
```
content as below:
```
bitmap_col_1 value
'a' 1
'b' 2
'c' 3
'd' 4
```
we use window function to complete encode new values
here (0) is the max value of ```dict_table_A```, for a empty ```dict_table_A ```, it is 0
**step3 use dict_table_A to encode source_hive_tableA**
```
insert overwrite source_hive_tableA
as select
source_hive_tableA.col1, t1.value
from
source_hive_tableA
left join(select bitamp_col_1,value from dict_table_A) t1
on source_hive_tableA.bitmap_col_1 = t1.bitamp_col_1
```
result as below
```
col1 bitmap_col_1
1 1
2 2
3 3
4 4
5 4
```
finally, the bitmap_col_1 is converted to integer type
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [incubator-doris] wangbo commented on issue #3061: [Spark Load] Doris Support Using Hive Table to Build Global Dict
Posted by GitBox <gi...@apache.org>.
wangbo commented on issue #3061:
URL: https://github.com/apache/incubator-doris/issues/3061#issuecomment-642688333
**Why gloal dict?**
We use ```roaringbitmap``` to compute distinct value.
But ```roaringbitmap``` 's input value can only be integer.
So We need a data structure to store a map from origin value(can be any type) to integer value.
That is global dict.
**How to build global dict?**
For example:
We have a hive table A, named ```source_hive_tableA```, content as below
```
col1 bitmap_col_1
1 'a'
2 'b'
3 'c'
4 'd'
5 'd'
```
**step1 make a group by table, named group_by_table_A**
sql as below:
```
insert overwrite group_by_table_A as select bitmap_col_1
from source_hive_tableA group by bitmap_col_1
```
result as below:
```
bitamp_col_1
'a'
'b'
'c'
'd'
```
**step2 make gloabl dict table,named dict_table_A**
sql as below:
```
insert overwrite table dict_table_A as select bitmap_col_1,
value
from dict_table_A
union all select t1.bitamp_col_1,
(row_number() over(order by t1.bitamp_col_1)) + (0)
from (
select bitamp_col_1
from group_by_table_A
where
bitamp_col_1 is not null
)t1
left join (
select bitmap_col_1,
value
from dict_table_A
)t2
on t1.bitamp_col_1 = t2.bitmap_col_1
where t2.value is null
```
content as below:
```
bitmap_col_1 value
'a' 1
'b' 2
'c' 3
'd' 4
```
we use window function to complete encode new values
here (0) is the max value of ```dict_table_A```, for a empty ```dict_table_A ```, it is 0
**step3 use dict_table_A to encode source_hive_tableA**
```
insert overwrite source_hive_tableA
as select
source_hive_tableA.col1, t1.value
from
source_hive_tableA
left join(select bitamp_col_1,value from dict_table_A) t1
on source_hive_tableA.bitmap_col_1 = t1.bitamp_col_1
```
result as below
```
col1 bitmap_col_1
1 1
2 2
3 3
4 4
5 4
```
finally, the bitmap_col_1 is converted to integer type
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org