You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@shardingsphere.apache.org by "sandynz (via GitHub)" <gi...@apache.org> on 2023/02/22 03:02:43 UTC

[GitHub] [shardingsphere] sandynz opened a new issue, #24291: Improve table records count calculation in pipeline job

sandynz opened a new issue, #24291:
URL: https://github.com/apache/shardingsphere/issues/24291

   ## Feature Request
   
   ### Is your feature request related to a problem?
   Yes
   
   ### Describe the feature you would like.
   Currently, `InventoryTaskSplitter.getTableRecordsCount` use `SELECT COUNT(*) FROM table` to query records count. It might cost too much time, even more than time out of connection.
   
   A possible solution is to query estimated records count from database.
   
   Some investigation on databases:
   
   #### MySQL investigation
   
   Query from `INFORMATION_SCHEMA.TABLES`. Example:
   
   ```
   mysql> select table_schema, table_name, table_rows from INFORMATION_SCHEMA.TABLES where table_schema='sysbench_ds_1';
   +---------------+------------+------------+
   | table_schema  | table_name | table_rows |
   +---------------+------------+------------+
   | sysbench_ds_1 | sbtest1    |  138247749 |
   +---------------+------------+------------+
   1 row in set (0.02 sec)
   
   mysql> analyze table sysbench_ds_1.sbtest1;
   +-----------------------+---------+----------+----------+
   | Table                 | Op      | Msg_type | Msg_text |
   +-----------------------+---------+----------+----------+
   | sysbench_ds_1.sbtest1 | analyze | status   | OK       |
   +-----------------------+---------+----------+----------+
   1 row in set (0.46 sec)
   
   mysql> select table_schema, table_name, table_rows from INFORMATION_SCHEMA.TABLES where table_schema='sysbench_ds_1';
   +---------------+------------+------------+
   | table_schema  | table_name | table_rows |
   +---------------+------------+------------+
   | sysbench_ds_1 | sbtest1    |  138082248 |
   +---------------+------------+------------+
   1 row in set (0.02 sec)
   
   mysql> explain select count(1) from sysbench_ds_1.sbtest1;
   +----+-------------+---------+------------+-------+---------------+------+---------+------+-----------+----------+-------------+
   | id | select_type | table   | partitions | type  | possible_keys | key  | key_len | ref  | rows      | filtered | Extra       |
   +----+-------------+---------+------------+-------+---------------+------+---------+------+-----------+----------+-------------+
   |  1 | SIMPLE      | sbtest1 | NULL       | index | NULL          | k_1  | 4       | NULL | 138082248 |   100.00 | Using index |
   +----+-------------+---------+------------+-------+---------------+------+---------+------+-----------+----------+-------------+
   1 row in set, 1 warning (0.16 sec)
   
   mysql> select count(1) from sysbench_ds_1.sbtest1;
   +-----------+
   | count(1)  |
   +-----------+
   | 140000000 |
   +-----------+
   1 row in set (36.82 sec)
   ```
   
   `0.02 sec` vs `36.82 sec` on 140 millions records table. The estimated records count has accurateness lost: `(140000000-138082248)/140000000.0*100 ~= 1.37`, it's acceptable.
   
   #### PostgreSQL investigation
   
   Query from `pg_class`. Example:
   ```
   test1=# select count(1) from t_order;
    count
   -------
        6
   (1 row)
   
   test1=# select relname,reltuples::integer from pg_class where relname='t_order';
    relname | reltuples
   ---------+-----------
    t_order |         0
   (1 row)
   
   
   test1=# analyze t_order;
   ANALYZE
   
   test1=# select relname,reltuples::integer from pg_class where relname='t_order';
    relname | reltuples
   ---------+-----------
    t_order |         6
   (1 row)
   ```
   
   TODO: it needs more test on large table.
   
   #### Possible changes
   
   - Add method `Optional<String> buildEstimatedCountSQL` in `PipelineSQLBuilder`, implement it if database support records count estimation. We could support `MySQL`, `PostgreSQL` and `openGauss` for now.
   - If estatimated records count is not supported, then use `buildCountSQL`
   
   Notice:
   - The estimated records count might return `0` from database
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@shardingsphere.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [shardingsphere] azexcy commented on issue #24291: Improve table records count calculation in pipeline job

Posted by "azexcy (via GitHub)" <gi...@apache.org>.

azexcy commented on issue #24291:
URL: https://github.com/apache/shardingsphere/issues/24291#issuecomment-1439925457

   PostgreSQL estimated count maybe greater than actual, and schema needs to be considered
   ```
   postgres=# select relname,reltuples::integer from pg_class where relname='sbtest';
    relname | reltuples
   ---------+-----------
    sbtest  |        -1
    sbtest  | 100000080
   (2 rows)
   
   postgres=# select count(*) from test.sbtest;
      count
   -----------
    100000000
   (1 row)
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@shardingsphere.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [shardingsphere] sandynz closed issue #24291: Improve table records count calculation in pipeline job

Posted by "sandynz (via GitHub)" <gi...@apache.org>.

sandynz closed issue #24291: Improve table records count calculation in pipeline job
URL: https://github.com/apache/shardingsphere/issues/24291


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@shardingsphere.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [shardingsphere] sandynz closed issue #24291: Improve table records count calculation in pipeline job

Posted by "sandynz (via GitHub)" <gi...@apache.org>.

sandynz closed issue #24291: Improve table records count calculation in pipeline job
URL: https://github.com/apache/shardingsphere/issues/24291


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@shardingsphere.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org