You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "davintjong-db (via GitHub)" <gi...@apache.org> on 2023/12/07 00:34:53 UTC

[PR] [WIP][SPARK-46294][SQL] Clean up semantics of init vs zero value [spark]

davintjong-db opened a new pull request, #44222:
URL: https://github.com/apache/spark/pull/44222

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   
   Cleaning up the semantics of init and zero value to the following. This also helps define what an "invalid" metric is.
   
   initValue is the starting value for a SQLMetric. If a metric has value equal to its initValue, then it should be filtered out before aggregating with SQLMetrics.stringValue().
    
   zeroValue defines the lowest value considered valid. If a SQLMetric is invalid, it is set to zeroValue upon receiving any updates, and it also reports zeroValue as its value to avoid exposing it to the user programatically (concern previouosly addressed in [SPARK-41442](https://issues.apache.org/jira/browse/SPARK-41442)).
   
   For many SQLMetrics, we use initValue = -1 and zeroValue = 0 to indicate that the metric is by default invalid. At the end of a task, we will update the metric making it valid, and the invalid metrics will be filtered out when calculating min, max, etc. as a workaround for [SPARK-11013](https://issues.apache.org/jira/browse/SPARK-11013).
   
   
   ### Why are the changes needed?
   
   The semantics of initValue and _zeroValue in SQLMetrics is a little bit confusing, since they effectively mean the same thing. Changing it to the following would be clearer, especially in terms of defining what an "invalid" metric is.
   
   ### Does this PR introduce _any_ user-facing change?
   No. This shouldn't change any behavior.
   
   ### How was this patch tested?
   Existing tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46294][SQL] Clean up semantics of init vs zero value [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44222:
URL: https://github.com/apache/spark/pull/44222#discussion_r1426174905


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala:
##########
@@ -37,36 +37,47 @@ import org.apache.spark.util.AccumulatorContext.internOption
  * the executor side are automatically propagated and shown in the SQL UI through metrics. Updates
  * on the driver side must be explicitly posted using [[SQLMetrics.postDriverMetricUpdates()]].
  */
-class SQLMetric(val metricType: String, initValue: Long = 0L) extends AccumulatorV2[Long, Long] {
-  // This is a workaround for SPARK-11013.
-  // We may use -1 as initial value of the accumulator, if the accumulator is valid, we will
-  // update it at the end of task and the value will be at least 0. Then we can filter out the -1
-  // values before calculate max, min, etc.
-  private[this] var _value = initValue
-  private var _zeroValue = initValue
+class SQLMetric(val metricType: String,
+                initValue: Long = 0L,
+                zeroValue: Long = 0L) extends AccumulatorV2[Long, Long] {
+  // initValue defines the initial value of the metric. zeroValue defines the lowest value
+  // considered valid. If a SQLMetric is invalid, it is set to zeroValue upon receiving any
+  // updates, and it also reports zeroValue as its value to avoid exposing it to the user
+  // programatically.
+  //
+  // For many SQLMetrics, we use initValue = -1 and zeroValue = 0 to indicate that the metric is
+  // by default invalid. At the end of a task, we will update the metric making it valid, and the
+  // invalid metrics will be filtered out when calculating min, max, etc. as a workaround for
+  // SPARK-11013.
+  private var _value = initValue
 
   override def copy(): SQLMetric = {
-    val newAcc = new SQLMetric(metricType, _value)
-    newAcc._zeroValue = initValue
+    val newAcc = new SQLMetric(metricType, initValue, zeroValue)
+    newAcc._value = _value
     newAcc
   }
 
-  override def reset(): Unit = _value = _zeroValue
+  override def reset(): Unit = _value = initValue
 
   override def merge(other: AccumulatorV2[Long, Long]): Unit = other match {
     case o: SQLMetric =>
-      if (o.value > 0) {
-        if (_value < 0) _value = 0
+      if (o.isValid) {
+        if (!isValid) _value = zeroValue
         _value += o.value
       }
     case _ => throw QueryExecutionErrors.cannotMergeClassWithOtherClassError(
       this.getClass.getName, other.getClass.getName)
   }
 
-  override def isZero: Boolean = _value == _zeroValue
+  // This is used to filter out metrics. Metrics with value equal to initValue should
+  // be filtered out, since they are either invalid or safe to filter without changing
+  // the aggregation defined in [[SQLMetrics.stringValue]].
+  override def isZero: Boolean = _value == initValue

Review Comment:
   this is a bit tricky as `isZero` is not true when we actually have the zero value...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46294][SQL] Clean up semantics of init vs zero value [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44222:
URL: https://github.com/apache/spark/pull/44222#discussion_r1426175626


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala:
##########
@@ -37,36 +37,47 @@ import org.apache.spark.util.AccumulatorContext.internOption
  * the executor side are automatically propagated and shown in the SQL UI through metrics. Updates
  * on the driver side must be explicitly posted using [[SQLMetrics.postDriverMetricUpdates()]].
  */
-class SQLMetric(val metricType: String, initValue: Long = 0L) extends AccumulatorV2[Long, Long] {
-  // This is a workaround for SPARK-11013.
-  // We may use -1 as initial value of the accumulator, if the accumulator is valid, we will
-  // update it at the end of task and the value will be at least 0. Then we can filter out the -1
-  // values before calculate max, min, etc.
-  private[this] var _value = initValue
-  private var _zeroValue = initValue
+class SQLMetric(val metricType: String,
+                initValue: Long = 0L,
+                zeroValue: Long = 0L) extends AccumulatorV2[Long, Long] {
+  // initValue defines the initial value of the metric. zeroValue defines the lowest value
+  // considered valid. If a SQLMetric is invalid, it is set to zeroValue upon receiving any
+  // updates, and it also reports zeroValue as its value to avoid exposing it to the user
+  // programatically.
+  //
+  // For many SQLMetrics, we use initValue = -1 and zeroValue = 0 to indicate that the metric is
+  // by default invalid. At the end of a task, we will update the metric making it valid, and the
+  // invalid metrics will be filtered out when calculating min, max, etc. as a workaround for
+  // SPARK-11013.
+  private var _value = initValue
 
   override def copy(): SQLMetric = {
-    val newAcc = new SQLMetric(metricType, _value)
-    newAcc._zeroValue = initValue
+    val newAcc = new SQLMetric(metricType, initValue, zeroValue)
+    newAcc._value = _value
     newAcc
   }
 
-  override def reset(): Unit = _value = _zeroValue
+  override def reset(): Unit = _value = initValue
 
   override def merge(other: AccumulatorV2[Long, Long]): Unit = other match {
     case o: SQLMetric =>
-      if (o.value > 0) {
-        if (_value < 0) _value = 0
+      if (o.isValid) {
+        if (!isValid) _value = zeroValue
         _value += o.value
       }
     case _ => throw QueryExecutionErrors.cannotMergeClassWithOtherClassError(
       this.getClass.getName, other.getClass.getName)
   }
 
-  override def isZero: Boolean = _value == _zeroValue
+  // This is used to filter out metrics. Metrics with value equal to initValue should
+  // be filtered out, since they are either invalid or safe to filter without changing
+  // the aggregation defined in [[SQLMetrics.stringValue]].
+  override def isZero: Boolean = _value == initValue

Review Comment:
   Let's enrich the comment to highlight that, we may want to collect the 0 value for calculating min/max/avg. We can still link to SPARK-11013.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46294][SQL] Clean up semantics of init vs zero value [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44222:
URL: https://github.com/apache/spark/pull/44222#discussion_r1426171810


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala:
##########
@@ -37,36 +37,47 @@ import org.apache.spark.util.AccumulatorContext.internOption
  * the executor side are automatically propagated and shown in the SQL UI through metrics. Updates
  * on the driver side must be explicitly posted using [[SQLMetrics.postDriverMetricUpdates()]].
  */
-class SQLMetric(val metricType: String, initValue: Long = 0L) extends AccumulatorV2[Long, Long] {
-  // This is a workaround for SPARK-11013.
-  // We may use -1 as initial value of the accumulator, if the accumulator is valid, we will
-  // update it at the end of task and the value will be at least 0. Then we can filter out the -1
-  // values before calculate max, min, etc.
-  private[this] var _value = initValue
-  private var _zeroValue = initValue
+class SQLMetric(val metricType: String,

Review Comment:
   ```suggestion
   class SQLMetric(
       val metricType: String,
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46294][SQL] Clean up semantics of init vs zero value [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #44222:
URL: https://github.com/apache/spark/pull/44222#issuecomment-1856973646

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46294][SQL] Clean up semantics of init vs zero value [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44222:
URL: https://github.com/apache/spark/pull/44222#discussion_r1426171943


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala:
##########
@@ -37,36 +37,47 @@ import org.apache.spark.util.AccumulatorContext.internOption
  * the executor side are automatically propagated and shown in the SQL UI through metrics. Updates
  * on the driver side must be explicitly posted using [[SQLMetrics.postDriverMetricUpdates()]].
  */
-class SQLMetric(val metricType: String, initValue: Long = 0L) extends AccumulatorV2[Long, Long] {
-  // This is a workaround for SPARK-11013.
-  // We may use -1 as initial value of the accumulator, if the accumulator is valid, we will
-  // update it at the end of task and the value will be at least 0. Then we can filter out the -1
-  // values before calculate max, min, etc.
-  private[this] var _value = initValue
-  private var _zeroValue = initValue
+class SQLMetric(val metricType: String,

Review Comment:
   4 spaces indentation for multi-line parameters



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46294][SQL] Clean up semantics of init vs zero value [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan closed pull request #44222: [SPARK-46294][SQL] Clean up semantics of init vs zero value
URL: https://github.com/apache/spark/pull/44222


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org