You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by "Jibing-Li (via GitHub)" <gi...@apache.org> on 2023/06/27 01:50:44 UTC

[GitHub] [doris] Jibing-Li opened a new pull request, #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Jibing-Li opened a new pull request, #21207:
URL: https://github.com/apache/doris/pull/21207

   
   <!--Describe your changes.-->
   
   Support estimate table row count based on file size.
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Jibing-Li commented on a diff in pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "Jibing-Li (via GitHub)" <gi...@apache.org>.
Jibing-Li commented on code in PR #21207:
URL: https://github.com/apache/doris/pull/21207#discussion_r1250368577


##########
fe/fe-core/src/main/java/org/apache/doris/statistics/util/StatisticsUtil.java:
##########
@@ -461,4 +478,102 @@ public static int getTableHealth(long totalRows, long updatedRows) {
             return (int) (healthCoefficient * 100.0);
         }
     }
+
+    /**
+     * Estimate hive table row count.
+     * First get it from remote table parameters. If not found, estimate it : totalSize/estimatedRowSize
+     * @param table Hive HMSExternalTable to estimate row count.
+     * @return estimated row count
+     */
+    public static long getHiveRowCount(HMSExternalTable table) {
+        Map<String, String> parameters = table.getRemoteTable().getParameters();
+        if (parameters == null) {
+            return -1;
+        }
+        // Table parameters contains row count, simply get and return it.
+        if (parameters.containsKey(NUM_ROWS)) {
+            return Long.parseLong(parameters.get(NUM_ROWS));
+        }
+        if (!parameters.containsKey(TOTAL_SIZE)) {
+            return -1;
+        }
+        // Table parameters doesn't contain row count but contain total size. Estimate row count : totalSize/rowSize
+        long totalSize = Long.parseLong(parameters.get(TOTAL_SIZE));
+        long estimatedRowSize = 0;
+        for (Column column : table.getFullSchema()) {
+            estimatedRowSize += column.getDataType().getSlotSize();
+        }
+        if (estimatedRowSize == 0) {
+            return 1;
+        }
+        return totalSize / estimatedRowSize;
+    }
+
+    /**
+     * Estimate iceberg table row count.
+     * Get the row count by adding all task file recordCount.
+     * @param table Iceberg HMSExternalTable to estimate row count.
+     * @return estimated row count
+     */
+    public static long getIcebergRowCount(HMSExternalTable table) {
+        long rowCount = 0;
+        try {
+            Table icebergTable = HiveMetaStoreClientHelper.getIcebergTable(table);
+            TableScan tableScan = icebergTable.newScan().includeColumnStats();
+            for (FileScanTask task : tableScan.planFiles()) {
+                rowCount += task.file().recordCount();
+            }
+            return rowCount;
+        } catch (Exception e) {
+            LOG.warn("Fail to collect row count for db {} table {}", table.getDbName(), table.getName(), e);
+        }
+        return -1;
+    }
+
+    /**
+     * Estimate hive table row count : totalFileSize/estimatedRowSize
+     * @param table Hive HMSExternalTable to estimate row count.
+     * @return estimated row count
+     */
+    public static long getRowCountFromFileList(HMSExternalTable table) {

Review Comment:
   At least we need to get all the partition values. Then we can randomly choose some of the partitions as sample partition to estimate the row count, assuming all partitions contain identical number of rows. In this case we don't need to access all data files in all partitions. Just need to access sample partitions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #21207:
URL: https://github.com/apache/doris/pull/21207#issuecomment-1616697627

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 42.11 seconds
    stream load tsv:          459 seconds loaded 74807831229 Bytes, about 155 MB/s
    stream load json:         19 seconds loaded 2358488459 Bytes, about 118 MB/s
    stream load orc:          57 seconds loaded 1101869774 Bytes, about 18 MB/s
    stream load parquet:          30 seconds loaded 861443392 Bytes, about 27 MB/s
    insert into select:          68.7 seconds inserted 10000000 Rows, about 145K ops/s
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230702151100_clickbench_pr_171025.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Jibing-Li commented on pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "Jibing-Li (via GitHub)" <gi...@apache.org>.
Jibing-Li commented on PR #21207:
URL: https://github.com/apache/doris/pull/21207#issuecomment-1612383584

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #21207:
URL: https://github.com/apache/doris/pull/21207#issuecomment-1619894012

   PR approved by anyone and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #21207:
URL: https://github.com/apache/doris/pull/21207#issuecomment-1619893932

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Jibing-Li commented on pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "Jibing-Li (via GitHub)" <gi...@apache.org>.
Jibing-Li commented on PR #21207:
URL: https://github.com/apache/doris/pull/21207#issuecomment-1616686094

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #21207:
URL: https://github.com/apache/doris/pull/21207#issuecomment-1614853857

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 39.01 seconds
    stream load tsv:          457 seconds loaded 74807831229 Bytes, about 156 MB/s
    stream load json:         20 seconds loaded 2358488459 Bytes, about 112 MB/s
    stream load orc:          57 seconds loaded 1101869774 Bytes, about 18 MB/s
    stream load parquet:          29 seconds loaded 861443392 Bytes, about 28 MB/s
    insert into select:          67.5 seconds inserted 10000000 Rows, about 148K ops/s
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230630155348_clickbench_pr_170675.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Jibing-Li commented on a diff in pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "Jibing-Li (via GitHub)" <gi...@apache.org>.
Jibing-Li commented on code in PR #21207:
URL: https://github.com/apache/doris/pull/21207#discussion_r1250368577


##########
fe/fe-core/src/main/java/org/apache/doris/statistics/util/StatisticsUtil.java:
##########
@@ -461,4 +478,102 @@ public static int getTableHealth(long totalRows, long updatedRows) {
             return (int) (healthCoefficient * 100.0);
         }
     }
+
+    /**
+     * Estimate hive table row count.
+     * First get it from remote table parameters. If not found, estimate it : totalSize/estimatedRowSize
+     * @param table Hive HMSExternalTable to estimate row count.
+     * @return estimated row count
+     */
+    public static long getHiveRowCount(HMSExternalTable table) {
+        Map<String, String> parameters = table.getRemoteTable().getParameters();
+        if (parameters == null) {
+            return -1;
+        }
+        // Table parameters contains row count, simply get and return it.
+        if (parameters.containsKey(NUM_ROWS)) {
+            return Long.parseLong(parameters.get(NUM_ROWS));
+        }
+        if (!parameters.containsKey(TOTAL_SIZE)) {
+            return -1;
+        }
+        // Table parameters doesn't contain row count but contain total size. Estimate row count : totalSize/rowSize
+        long totalSize = Long.parseLong(parameters.get(TOTAL_SIZE));
+        long estimatedRowSize = 0;
+        for (Column column : table.getFullSchema()) {
+            estimatedRowSize += column.getDataType().getSlotSize();
+        }
+        if (estimatedRowSize == 0) {
+            return 1;
+        }
+        return totalSize / estimatedRowSize;
+    }
+
+    /**
+     * Estimate iceberg table row count.
+     * Get the row count by adding all task file recordCount.
+     * @param table Iceberg HMSExternalTable to estimate row count.
+     * @return estimated row count
+     */
+    public static long getIcebergRowCount(HMSExternalTable table) {
+        long rowCount = 0;
+        try {
+            Table icebergTable = HiveMetaStoreClientHelper.getIcebergTable(table);
+            TableScan tableScan = icebergTable.newScan().includeColumnStats();
+            for (FileScanTask task : tableScan.planFiles()) {
+                rowCount += task.file().recordCount();
+            }
+            return rowCount;
+        } catch (Exception e) {
+            LOG.warn("Fail to collect row count for db {} table {}", table.getDbName(), table.getName(), e);
+        }
+        return -1;
+    }
+
+    /**
+     * Estimate hive table row count : totalFileSize/estimatedRowSize
+     * @param table Hive HMSExternalTable to estimate row count.
+     * @return estimated row count
+     */
+    public static long getRowCountFromFileList(HMSExternalTable table) {

Review Comment:
   At least we need to get all the partition values. Then we can randomly choose some of the partitions as sample partition to estimate the row count, assume all partitions contains identical number of rows. In this case we don't need to access all data files in all partitions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #21207:
URL: https://github.com/apache/doris/pull/21207#issuecomment-1614732225

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 37.57 seconds
    stream load tsv:          456 seconds loaded 74807831229 Bytes, about 156 MB/s
    stream load json:         20 seconds loaded 2358488459 Bytes, about 112 MB/s
    stream load orc:          57 seconds loaded 1101869774 Bytes, about 18 MB/s
    stream load parquet:          29 seconds loaded 861443392 Bytes, about 28 MB/s
    insert into select:          70.4 seconds inserted 10000000 Rows, about 142K ops/s
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230630142533_clickbench_pr_170605.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] morningman merged pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "morningman (via GitHub)" <gi...@apache.org>.
morningman merged PR #21207:
URL: https://github.com/apache/doris/pull/21207


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Jibing-Li commented on pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "Jibing-Li (via GitHub)" <gi...@apache.org>.
Jibing-Li commented on PR #21207:
URL: https://github.com/apache/doris/pull/21207#issuecomment-1614751373

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] morningman commented on a diff in pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "morningman (via GitHub)" <gi...@apache.org>.
morningman commented on code in PR #21207:
URL: https://github.com/apache/doris/pull/21207#discussion_r1250282397


##########
fe/fe-core/src/main/java/org/apache/doris/statistics/util/StatisticsUtil.java:
##########
@@ -461,4 +478,102 @@ public static int getTableHealth(long totalRows, long updatedRows) {
             return (int) (healthCoefficient * 100.0);
         }
     }
+
+    /**
+     * Estimate hive table row count.
+     * First get it from remote table parameters. If not found, estimate it : totalSize/estimatedRowSize
+     * @param table Hive HMSExternalTable to estimate row count.
+     * @return estimated row count
+     */
+    public static long getHiveRowCount(HMSExternalTable table) {
+        Map<String, String> parameters = table.getRemoteTable().getParameters();
+        if (parameters == null) {
+            return -1;
+        }
+        // Table parameters contains row count, simply get and return it.
+        if (parameters.containsKey(NUM_ROWS)) {
+            return Long.parseLong(parameters.get(NUM_ROWS));
+        }
+        if (!parameters.containsKey(TOTAL_SIZE)) {
+            return -1;
+        }
+        // Table parameters doesn't contain row count but contain total size. Estimate row count : totalSize/rowSize
+        long totalSize = Long.parseLong(parameters.get(TOTAL_SIZE));
+        long estimatedRowSize = 0;
+        for (Column column : table.getFullSchema()) {
+            estimatedRowSize += column.getDataType().getSlotSize();
+        }
+        if (estimatedRowSize == 0) {
+            return 1;
+        }
+        return totalSize / estimatedRowSize;
+    }
+
+    /**
+     * Estimate iceberg table row count.
+     * Get the row count by adding all task file recordCount.
+     * @param table Iceberg HMSExternalTable to estimate row count.
+     * @return estimated row count
+     */
+    public static long getIcebergRowCount(HMSExternalTable table) {
+        long rowCount = 0;
+        try {
+            Table icebergTable = HiveMetaStoreClientHelper.getIcebergTable(table);
+            TableScan tableScan = icebergTable.newScan().includeColumnStats();
+            for (FileScanTask task : tableScan.planFiles()) {
+                rowCount += task.file().recordCount();
+            }
+            return rowCount;
+        } catch (Exception e) {
+            LOG.warn("Fail to collect row count for db {} table {}", table.getDbName(), table.getName(), e);
+        }
+        return -1;
+    }
+
+    /**
+     * Estimate hive table row count : totalFileSize/estimatedRowSize
+     * @param table Hive HMSExternalTable to estimate row count.
+     * @return estimated row count
+     */
+    public static long getRowCountFromFileList(HMSExternalTable table) {

Review Comment:
   Do we need to get all partitions for calculating row count?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Jibing-Li commented on a diff in pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "Jibing-Li (via GitHub)" <gi...@apache.org>.
Jibing-Li commented on code in PR #21207:
URL: https://github.com/apache/doris/pull/21207#discussion_r1250368577


##########
fe/fe-core/src/main/java/org/apache/doris/statistics/util/StatisticsUtil.java:
##########
@@ -461,4 +478,102 @@ public static int getTableHealth(long totalRows, long updatedRows) {
             return (int) (healthCoefficient * 100.0);
         }
     }
+
+    /**
+     * Estimate hive table row count.
+     * First get it from remote table parameters. If not found, estimate it : totalSize/estimatedRowSize
+     * @param table Hive HMSExternalTable to estimate row count.
+     * @return estimated row count
+     */
+    public static long getHiveRowCount(HMSExternalTable table) {
+        Map<String, String> parameters = table.getRemoteTable().getParameters();
+        if (parameters == null) {
+            return -1;
+        }
+        // Table parameters contains row count, simply get and return it.
+        if (parameters.containsKey(NUM_ROWS)) {
+            return Long.parseLong(parameters.get(NUM_ROWS));
+        }
+        if (!parameters.containsKey(TOTAL_SIZE)) {
+            return -1;
+        }
+        // Table parameters doesn't contain row count but contain total size. Estimate row count : totalSize/rowSize
+        long totalSize = Long.parseLong(parameters.get(TOTAL_SIZE));
+        long estimatedRowSize = 0;
+        for (Column column : table.getFullSchema()) {
+            estimatedRowSize += column.getDataType().getSlotSize();
+        }
+        if (estimatedRowSize == 0) {
+            return 1;
+        }
+        return totalSize / estimatedRowSize;
+    }
+
+    /**
+     * Estimate iceberg table row count.
+     * Get the row count by adding all task file recordCount.
+     * @param table Iceberg HMSExternalTable to estimate row count.
+     * @return estimated row count
+     */
+    public static long getIcebergRowCount(HMSExternalTable table) {
+        long rowCount = 0;
+        try {
+            Table icebergTable = HiveMetaStoreClientHelper.getIcebergTable(table);
+            TableScan tableScan = icebergTable.newScan().includeColumnStats();
+            for (FileScanTask task : tableScan.planFiles()) {
+                rowCount += task.file().recordCount();
+            }
+            return rowCount;
+        } catch (Exception e) {
+            LOG.warn("Fail to collect row count for db {} table {}", table.getDbName(), table.getName(), e);
+        }
+        return -1;
+    }
+
+    /**
+     * Estimate hive table row count : totalFileSize/estimatedRowSize
+     * @param table Hive HMSExternalTable to estimate row count.
+     * @return estimated row count
+     */
+    public static long getRowCountFromFileList(HMSExternalTable table) {

Review Comment:
   At least we need to get all the partition names. Then we can randomly choose some of the partitions as sample partition to estimate the row count. Assume all partitions contains identical number of rows.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #21207:
URL: https://github.com/apache/doris/pull/21207#issuecomment-1618173783

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 37.88 seconds
    stream load tsv:          457 seconds loaded 74807831229 Bytes, about 156 MB/s
    stream load json:         20 seconds loaded 2358488459 Bytes, about 112 MB/s
    stream load orc:          57 seconds loaded 1101869774 Bytes, about 18 MB/s
    stream load parquet:          28 seconds loaded 861443392 Bytes, about 29 MB/s
    insert into select:          69.1 seconds inserted 10000000 Rows, about 144K ops/s
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230703123155_clickbench_pr_171540.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Jibing-Li commented on a diff in pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "Jibing-Li (via GitHub)" <gi...@apache.org>.
Jibing-Li commented on code in PR #21207:
URL: https://github.com/apache/doris/pull/21207#discussion_r1246007116


##########
fe/fe-core/src/main/java/org/apache/doris/catalog/external/HMSExternalTable.java:
##########
@@ -433,5 +467,83 @@ private void initPartitionColumns(List<Column> schema) {
         LOG.debug("get {} partition columns for table: {}", partitionColumns.size(), name);
     }
 
+    private long getHiveRowCount() {
+        Map<String, String> parameters = remoteTable.getParameters();
+        if (parameters == null) {
+            return -1;
+        }
+        if (parameters.containsKey(NUM_ROWS)) {
+            return Long.parseLong(parameters.get(NUM_ROWS));
+        }
+        if (!parameters.containsKey(TOTAL_SIZE)) {
+            return -1;
+        }
+        long totalSize = Long.parseLong(parameters.get(TOTAL_SIZE));
+        long estimatedRowSize = 0;
+        for (Column column : getFullSchema()) {
+            estimatedRowSize += column.getDataType().getSlotSize();
+        }
+        if (estimatedRowSize == 0) {
+            return 1;
+        }
+        return totalSize / estimatedRowSize;
+    }
+
+    private long getIcebergRowCount() {
+        long rowCount = 0;
+        try {
+            Table icebergTable = HiveMetaStoreClientHelper.getIcebergTable(this);
+            TableScan tableScan = icebergTable.newScan().includeColumnStats();
+            for (FileScanTask task : tableScan.planFiles()) {
+                rowCount += task.file().recordCount();
+            }
+            return rowCount;
+        } catch (Exception e) {
+            LOG.warn(String.format("Fail to collect row count for db %s table %s", dbName, name), e);
+        }
+        return -1;
+    }
+
+    private long getRowCountFromFileList() {

Review Comment:
   Using cache to avoid call this function every time, only call it once for the first time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Jibing-Li commented on pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "Jibing-Li (via GitHub)" <gi...@apache.org>.
Jibing-Li commented on PR #21207:
URL: https://github.com/apache/doris/pull/21207#issuecomment-1618110035

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] morningman commented on a diff in pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "morningman (via GitHub)" <gi...@apache.org>.
morningman commented on code in PR #21207:
URL: https://github.com/apache/doris/pull/21207#discussion_r1243407638


##########
fe/fe-core/src/main/java/org/apache/doris/catalog/external/HMSExternalTable.java:
##########
@@ -348,6 +375,11 @@ public Partition getPartition(List<String> partitionValues) {
         return client.getPartition(dbName, name, partitionValues);
     }
 
+    public List<Partition> getPartitionsByFilter(String filter) {

Review Comment:
   unused?



##########
fe/fe-core/src/main/java/org/apache/doris/catalog/external/HMSExternalTable.java:
##########
@@ -433,5 +467,83 @@ private void initPartitionColumns(List<Column> schema) {
         LOG.debug("get {} partition columns for table: {}", partitionColumns.size(), name);
     }
 
+    private long getHiveRowCount() {
+        Map<String, String> parameters = remoteTable.getParameters();
+        if (parameters == null) {
+            return -1;
+        }
+        if (parameters.containsKey(NUM_ROWS)) {
+            return Long.parseLong(parameters.get(NUM_ROWS));
+        }
+        if (!parameters.containsKey(TOTAL_SIZE)) {
+            return -1;
+        }
+        long totalSize = Long.parseLong(parameters.get(TOTAL_SIZE));
+        long estimatedRowSize = 0;
+        for (Column column : getFullSchema()) {
+            estimatedRowSize += column.getDataType().getSlotSize();
+        }
+        if (estimatedRowSize == 0) {
+            return 1;
+        }
+        return totalSize / estimatedRowSize;
+    }
+
+    private long getIcebergRowCount() {
+        long rowCount = 0;
+        try {
+            Table icebergTable = HiveMetaStoreClientHelper.getIcebergTable(this);
+            TableScan tableScan = icebergTable.newScan().includeColumnStats();
+            for (FileScanTask task : tableScan.planFiles()) {
+                rowCount += task.file().recordCount();
+            }
+            return rowCount;
+        } catch (Exception e) {
+            LOG.warn(String.format("Fail to collect row count for db %s table %s", dbName, name), e);
+        }
+        return -1;
+    }
+
+    private long getRowCountFromFileList() {

Review Comment:
   This method is costy.



##########
fe/fe-core/src/main/java/org/apache/doris/catalog/external/HMSExternalTable.java:
##########
@@ -433,5 +467,83 @@ private void initPartitionColumns(List<Column> schema) {
         LOG.debug("get {} partition columns for table: {}", partitionColumns.size(), name);
     }
 
+    private long getHiveRowCount() {
+        Map<String, String> parameters = remoteTable.getParameters();
+        if (parameters == null) {
+            return -1;
+        }
+        if (parameters.containsKey(NUM_ROWS)) {
+            return Long.parseLong(parameters.get(NUM_ROWS));
+        }
+        if (!parameters.containsKey(TOTAL_SIZE)) {
+            return -1;
+        }
+        long totalSize = Long.parseLong(parameters.get(TOTAL_SIZE));
+        long estimatedRowSize = 0;
+        for (Column column : getFullSchema()) {
+            estimatedRowSize += column.getDataType().getSlotSize();
+        }
+        if (estimatedRowSize == 0) {
+            return 1;
+        }
+        return totalSize / estimatedRowSize;
+    }
+
+    private long getIcebergRowCount() {

Review Comment:
   2 issues:
   1. I think this method should be move to a separate class, this is not only for `HMSExternalTable`. How about create a new Util class and move all `getRowCountxxx` methods to it.
   2. Here you call `icebergTable.newScan()`, so that each query, we need to call `icebergTable.newScan()` twice, one is here, the other is in plan process, which is costy.



##########
fe/fe-core/src/main/java/org/apache/doris/catalog/external/HMSExternalTable.java:
##########
@@ -433,5 +467,83 @@ private void initPartitionColumns(List<Column> schema) {
         LOG.debug("get {} partition columns for table: {}", partitionColumns.size(), name);
     }
 
+    private long getHiveRowCount() {
+        Map<String, String> parameters = remoteTable.getParameters();
+        if (parameters == null) {
+            return -1;
+        }
+        if (parameters.containsKey(NUM_ROWS)) {
+            return Long.parseLong(parameters.get(NUM_ROWS));
+        }
+        if (!parameters.containsKey(TOTAL_SIZE)) {
+            return -1;
+        }
+        long totalSize = Long.parseLong(parameters.get(TOTAL_SIZE));
+        long estimatedRowSize = 0;
+        for (Column column : getFullSchema()) {
+            estimatedRowSize += column.getDataType().getSlotSize();
+        }
+        if (estimatedRowSize == 0) {
+            return 1;
+        }
+        return totalSize / estimatedRowSize;
+    }
+
+    private long getIcebergRowCount() {
+        long rowCount = 0;
+        try {
+            Table icebergTable = HiveMetaStoreClientHelper.getIcebergTable(this);
+            TableScan tableScan = icebergTable.newScan().includeColumnStats();
+            for (FileScanTask task : tableScan.planFiles()) {
+                rowCount += task.file().recordCount();
+            }
+            return rowCount;
+        } catch (Exception e) {
+            LOG.warn(String.format("Fail to collect row count for db %s table %s", dbName, name), e);

Review Comment:
   use `{}` instead of `String.format` to make code simple



##########
fe/fe-core/src/main/java/org/apache/doris/catalog/external/HMSExternalTable.java:
##########
@@ -395,7 +427,9 @@ public long estimatedRowCount() {
             Optional<TableStatistic> tableStatistics = Env.getCurrentEnv().getStatisticsCache().getTableStatistics(
                     catalog.getId(), catalog.getDbOrAnalysisException(dbName).getId(), id);
             if (tableStatistics.isPresent()) {
-                return tableStatistics.get().rowCount;
+                long rowCount = tableStatistics.get().rowCount;
+                LOG.info(String.format("Estimated row count for db %s table %s is %d.", dbName, name, rowCount));

Review Comment:
   ```suggestion
                   LOG.debug("Estimated row count for db {} table {} is {}.", dbName, name, rowCount);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Jibing-Li commented on pull request #21207: [improvement](statistics, multi catalog)Estimate hive table row count based on file size.

Posted by "Jibing-Li (via GitHub)" <gi...@apache.org>.
Jibing-Li commented on PR #21207:
URL: https://github.com/apache/doris/pull/21207#issuecomment-1614590090

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org