You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/16 05:09:48 UTC

[GitHub] [spark] linhongliu-db opened a new pull request, #37532: [DRAFT][SPARK-39989][SQL][FollowUp] Improve foldable expression stats estimate

linhongliu-db opened a new pull request, #37532:
URL: https://github.com/apache/spark/pull/37532

   ### What changes were proposed in this pull request?
   This PR improves the foldable expression statistics estimation by providing more accurate min, max, and data length.
   
   
   ### Why are the changes needed?
   Improve the accuracy of the statistics.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   WIP
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] linhongliu-db commented on a diff in pull request #37532: [SPARK-39989][SQL][FollowUp] Improve foldable expression stats estimate for string and binary

Posted by GitBox <gi...@apache.org>.
linhongliu-db commented on code in PR #37532:
URL: https://github.com/apache/spark/pull/37532#discussion_r964292433


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala:
##########
@@ -23,9 +23,22 @@ import scala.math.BigDecimal.RoundingMode
 import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeMap, EmptyRow, Expression}
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.types.{DecimalType, _}
+import org.apache.spark.unsafe.types.UTF8String
 
 object EstimationUtils {
 
+  /** Returns true iff the we support column statistics on column of the given type. */
+  def supportsType(dataType: DataType): Boolean = dataType match {

Review Comment:
   yes.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala:
##########
@@ -128,6 +149,25 @@ object EstimationUtils {
     if (outputRowCount > 0) outputRowCount * getSizePerRow(attributes, attrStats) else 1
   }
 
+  /**
+   * Get the size in bytes of a value of a constant expression
+   * @param expr
+   * @return the size in bytes
+   */
+  def getConstantLen(expr: Expression): Long = {

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37532: [SPARK-39989][SQL][FollowUp] Improve foldable expression stats estimate for string and binary

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37532:
URL: https://github.com/apache/spark/pull/37532#discussion_r964351789


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala:
##########
@@ -23,9 +23,22 @@ import scala.math.BigDecimal.RoundingMode
 import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeMap, EmptyRow, Expression}
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.types.{DecimalType, _}
+import org.apache.spark.unsafe.types.UTF8String
 
 object EstimationUtils {
 
+  /** Returns true iff the we support column statistics on column of the given type. */
+  def supportsType(dataType: DataType): Boolean = dataType match {
+    case _: IntegralType => true
+    case _: DecimalType => true
+    case DoubleType | FloatType => true
+    case BooleanType => true
+    case DateType => true
+    case TimestampType => true

Review Comment:
   how about `AnsiInterval`?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala:
##########
@@ -23,9 +23,22 @@ import scala.math.BigDecimal.RoundingMode
 import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeMap, EmptyRow, Expression}
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.types.{DecimalType, _}
+import org.apache.spark.unsafe.types.UTF8String
 
 object EstimationUtils {
 
+  /** Returns true iff the we support column statistics on column of the given type. */
+  def supportsType(dataType: DataType): Boolean = dataType match {
+    case _: IntegralType => true
+    case _: DecimalType => true
+    case DoubleType | FloatType => true
+    case BooleanType => true
+    case DateType => true
+    case TimestampType => true

Review Comment:
   how about `AnsiIntervalType`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #37532: [SPARK-39989][SQL][FollowUp] Improve foldable expression stats estimate for string and binary

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #37532: [SPARK-39989][SQL][FollowUp] Improve foldable expression stats estimate for string and binary
URL: https://github.com/apache/spark/pull/37532


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a diff in pull request #37532: [SPARK-39989][SQL][FollowUp] Improve foldable expression stats estimate for string and binary

Posted by GitBox <gi...@apache.org>.
wangyum commented on code in PR #37532:
URL: https://github.com/apache/spark/pull/37532#discussion_r962105864


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala:
##########
@@ -128,6 +149,25 @@ object EstimationUtils {
     if (outputRowCount > 0) outputRowCount * getSizePerRow(attributes, attrStats) else 1
   }
 
+  /**
+   * Get the size in bytes of a value of a constant expression
+   * @param expr
+   * @return the size in bytes
+   */
+  def getConstantLen(expr: Expression): Long = {

Review Comment:
   Use "value"? Otherwise the expression needs to be evaluated twice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37532: [SPARK-39989][SQL][FollowUp] Improve foldable expression stats estimate for string and binary

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37532:
URL: https://github.com/apache/spark/pull/37532#discussion_r963258833


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala:
##########
@@ -23,9 +23,22 @@ import scala.math.BigDecimal.RoundingMode
 import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeMap, EmptyRow, Expression}
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.types.{DecimalType, _}
+import org.apache.spark.unsafe.types.UTF8String
 
 object EstimationUtils {
 
+  /** Returns true iff the we support column statistics on column of the given type. */
+  def supportsType(dataType: DataType): Boolean = dataType match {

Review Comment:
   is it only for constants?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] linhongliu-db commented on pull request #37532: [SPARK-39989][SQL][FollowUp] Improve foldable expression stats estimate for string and binary

Posted by GitBox <gi...@apache.org>.
linhongliu-db commented on PR #37532:
URL: https://github.com/apache/spark/pull/37532#issuecomment-1235913827

   cc @cloud-fan @wangyum 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #37532: [SPARK-39989][SQL][FollowUp] Improve foldable expression stats estimate for string and binary

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #37532:
URL: https://github.com/apache/spark/pull/37532#issuecomment-1355860756

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org