You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by "xzj7019 (via GitHub)" <gi...@apache.org> on 2023/06/12 08:21:54 UTC

[GitHub] [doris] xzj7019 commented on a diff in pull request #20642: [tpcds](nereids) estimate distribution cost by byte size instead of row count

xzj7019 commented on code in PR #20642:
URL: https://github.com/apache/doris/pull/20642#discussion_r1226268711


##########
fe/fe-core/src/main/java/org/apache/doris/nereids/cost/CostModelV1.java:
##########
@@ -191,7 +191,7 @@ public Cost visitPhysicalDistribute(
             return CostV1.of(
                     0,
                     0,
-                    childStatistics.getRowCount() * Math.pow(beNumber, 0.5));

Review Comment:
   In cost model v1, we use row count as the unified metrics to measure the cost. So here we would better to also use row-count based cost system, if we need to distinguish the distribute cost for different cases, we can use a factor parameter using dataSize as the input with a minimal value 1.



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/cost/CostModelV1.java:
##########
@@ -161,27 +161,27 @@ public Cost visitPhysicalPartitionTopN(PhysicalPartitionTopN<? extends Plan> par
     @Override
     public Cost visitPhysicalDistribute(
             PhysicalDistribute<? extends Plan> distribute, PlanContext context) {
+        int kBytes = 1024;
         Statistics childStatistics = context.getChildStatistics(0);
         DistributionSpec spec = distribute.getDistributionSpec();
+        int beNumber = ConnectContext.get().getEnv().getClusterInfo().getBackendsNumber(true);
+        beNumber = Math.max(1, beNumber);
+        double dataSize = childStatistics.computeSize() / kBytes; // in K bytes
         // shuffle
         if (spec instanceof DistributionSpecHash) {
             return CostV1.of(
                     0,
                     0,
-                    childStatistics.getRowCount());

Review Comment:
   In cost model v1, we use row count as the unified metrics to measure the cost. So here we would better to also use row-count based cost system, if we need to distinguish the distribute cost for different cases, we can use a factor parameter using dataSize as the input with a minimal value 1. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org