You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2021/08/03 13:43:40 UTC

[GitHub] [incubator-doris] EmmyMiao87 opened a new issue #6370: Collect richer statistical information

EmmyMiao87 opened a new issue #6370:
URL: https://github.com/apache/incubator-doris/issues/6370


   # Background
   Doris's current statistics are only inaccurate table rows and table size. Due to the lack of some more accurate column statistics, the Doris query plan cannot evaluate and select a better plan. Also the current statistical information is not a complete structure.
   
   I propose to add a statistical information collection module in the query layer. Mainly responsible for the collection, storage, use and other functions of statistical information, so as to assist the optimizer to obtain better query planning.
   
   # Requirements
   1. Statistics include table stat (row count, data size), column stat (ndv, min, max, null_nums etc)
   2. Show statistics
   3. Collect statistics (sample, all)
   4. Storage statistics (memory and persistence)
   5. Up-to-date or out-to-date
   6. Compatible with old statistical logic
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on issue #6370: Collect richer statistical information

Posted by GitBox <gi...@apache.org>.
wangbo commented on issue #6370:
URL: https://github.com/apache/incubator-doris/issues/6370#issuecomment-895160184


   I have some questions about detailed design.
   1 Currently, how much column statistics we need to collect in total? In addition to the column listed above.
   2 Is current design scalability? For example, I want to add a new column statistic, named ```avg_char_len```, how to make it a lightweight operation. It seems that we need to do a schema change in current design.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] EmmyMiao87 commented on issue #6370: Collect richer statistical information

Posted by GitBox <gi...@apache.org>.
EmmyMiao87 commented on issue #6370:
URL: https://github.com/apache/incubator-doris/issues/6370#issuecomment-891862112


   # Collect Statistics
   
   ## grammar
   
   In the first stage, only consider using statements to manually trigger collection tasks. The user triggers the collection task through ``` ANALYZE statement ```, and collects the statistical information of the specified table or column according to the provided parameters. The collected statistics are stored in Doris.
   
   ```sql
   ANALYZE [ table_and_columns [, ...] ]
   
   table_and_columns is:
       table_name [ ( column_name [, ...] ) ]
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] EmmyMiao87 edited a comment on issue #6370: Collect richer statistical information

Posted by GitBox <gi...@apache.org>.
EmmyMiao87 edited a comment on issue #6370:
URL: https://github.com/apache/incubator-doris/issues/6370#issuecomment-891860595


   # Show statistics
   
   + Show table statistics
   
       If table_name is filled in, the information of a single table will be displayed; if not, the information of all tables in the entire library will be displayed.	
   
     ```sql
     show table stats {db_name.table_name };
     ```
   
   	| table_name | row_count | data_size |
   	| :--------- | --------- | --------- |
   	| store      | 100000    | 1MB       |
   
   + Show column statistics
   
     ```sql
     show column stats table_name;
     ```
   
     | column_name | num_distinct_value | min  | max  | num_nulls | avg_col_len | max_col_len |
     | ----------- | ------------------ | ---- | ---- | --------- | ----------- | --- |
     | s_store_sk  | 10                 | 1    | 100  | 0         | 4           |4|
     | s_store_id  | -1                 | -1   | -1   | -1        | 4           |4|
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] EmmyMiao87 commented on issue #6370: Collect richer statistical information

Posted by GitBox <gi...@apache.org>.
EmmyMiao87 commented on issue #6370:
URL: https://github.com/apache/incubator-doris/issues/6370#issuecomment-896046939


   > 1 Currently, how much column statistics we need to collect in total? In addition to the column listed above.
   
   In the first stage, we only collect these column statistics listed above. Is there anything you want to add?
   
   > 2 Does the current design have good scalability? For example, I want to add a new column statistic, named `avg_char_len`, how to make it a lightweight operation. It seems that we need to do a schema change in current design.
   
   Although the column statistical information listed here is in the form of a table structure, it is not actually a real table. So there is no need for schema change
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] EmmyMiao87 edited a comment on issue #6370: Collect richer statistical information

Posted by GitBox <gi...@apache.org>.
EmmyMiao87 edited a comment on issue #6370:
URL: https://github.com/apache/incubator-doris/issues/6370#issuecomment-891864215


   # Manually inject statistics
   
   ## grammar
   
   The statistical values that can be injected manually include: row_count, data_size etc。
   
   1. Manually inject statistics into the table
   
   	```sql
   	ALTER TABLE table_name set stats ("key" = "value");
   	```
   	
   	For example: manually set the number of rows in the table store to 10000
   	```sql
   	ALTER TABLE store set stats ("row_count" = "10000");
   	```
   	After the update, you can see that the row count of the table store has been updated to 10000 by ``` show table stats store```.
   	| table_name | row_count | data_size |
   	| :--------- | --------- | --------- |
   	| store      | **10000** | 1MB       |
   
   2. Manually inject column statistics
       ```sql
       ALTER TABLE table_name MODIFY COLUMN column_name set stats ("key" = "value");
       ```
   
       For example: manually set the num_distinct_value of the column s_store_sk of the table store to 5.
   
       ```sql
       ALTER TABLE store modify column s_store_sk set stats ("num_distinct_value" = "5");
       ```
   
       After the update, you can see that the num_distinct_value value of the column s_store_id of the table store is updated to 5.
   
       | column_name | num_distinct_value | min  | max  | num_nulls | avg_col_len | max_col_len |
       | ----------- | ------------------ | ---- | ---- | --------- | ----------- | ----------- |
       | s_store_sk  | 10                 | 1    | 100  | 0         | 4           | 4           |
       | s_store_id  | **5**              | -1   | -1   | -1        | 4           | 4           |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] EmmyMiao87 commented on issue #6370: [Proposal] Collect richer statistical information

Posted by GitBox <gi...@apache.org>.
EmmyMiao87 commented on issue #6370:
URL: https://github.com/apache/incubator-doris/issues/6370#issuecomment-977577440


   抽样库 https://datasketches.apache.org/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo edited a comment on issue #6370: Collect richer statistical information

Posted by GitBox <gi...@apache.org>.
wangbo edited a comment on issue #6370:
URL: https://github.com/apache/incubator-doris/issues/6370#issuecomment-895160184


   I have some questions about detailed design:
   1 Currently, how much column statistics we need to collect in total? In addition to the column listed above.
   2 Does the current design have good scalability? For example, I want to add a new column statistic, named ```avg_char_len```, how to make it a lightweight operation. It seems that we need to do a schema change in current design.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] EmmyMiao87 commented on issue #6370: Collect richer statistical information

Posted by GitBox <gi...@apache.org>.
EmmyMiao87 commented on issue #6370:
URL: https://github.com/apache/incubator-doris/issues/6370#issuecomment-891860595


   # Show statistics
   
   + 展示表统计信息
   
     如果填写 table_name 则展示单表信息,如果不填则展示整个库的所有表的信息。	
   
     ```sql
     show table stats {db_name.table_name };
     ```
   
   	| table_name | row_count | data_size |
   	| :--------- | --------- | --------- |
   	| store      | 100000    | 1MB       |
   
   + 展示列统计信息
   
     ```sql
     show column stats table_name;
     ```
   
     | column_name | num_distinct_value | min  | max  | num_nulls | avg_col_len | max_col_len |
     | ----------- | ------------------ | ---- | ---- | --------- | ----------- | --- |
     | s_store_sk  | 10                 | 1    | 100  | 0         | 4           |4|
     | s_store_id  | -1                 | -1   | -1   | -1        | 4           |4|
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] EmmyMiao87 commented on issue #6370: Collect richer statistical information

Posted by GitBox <gi...@apache.org>.
EmmyMiao87 commented on issue #6370:
URL: https://github.com/apache/incubator-doris/issues/6370#issuecomment-891864215


   # Manually inject statistics
   
   ## grammar
   
   The statistical values that can be injected manually include: row_count, data_size etc。
   
   1. Manually inject statistics into the table
   
   	```sql
   	ALTER TABLE table_name set stats ("key" = "value");
   	```
   	
   	For example: manually set the number of rows in the table store to 10000
   	```sql
   	ALTER TABLE store set stats ("row_count" = "10000");
   	```
   	After the update, you can see that the row count of the table store has been updated to 10000 by ``` show table stats store```.
   	| table_name | row_count | data_size |
   	| :--------- | --------- | --------- |
   	| store      | **10000** | 1MB       |
   
   2. Manually inject column statistics
   
     ```sql
     ALTER TABLE table_name MODIFY COLUMN column_name set stats ("key" = "value");
     ```
   
     For example: manually set the num_distinct_value of the column s_store_sk of the table store to 5.
   
     ```sql
     ALTER TABLE store modify column s_store_sk set stats ("num_distinct_value" = "5");
     ```
   
     After the update, you can see that the num_distinct_value value of the column s_store_id of the table store is updated to 5.
   
   | column_name | num_distinct_value | min  | max  | num_nulls | avg_col_len | max_col_len |
   | ----------- | ------------------ | ---- | ---- | --------- | ----------- | ----------- |
   | s_store_sk  | 10                 | 1    | 100  | 0         | 4           | 4           |
   | s_store_id  | **5**              | -1   | -1   | -1        | 4           | 4           |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org