You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/18 23:18:56 UTC

[GitHub] [spark] ryan-johnson-databricks opened a new pull request, #37573: [SPARK-40141] Remove unnecessary TaskContext addTaskXxxListener overloads

ryan-johnson-databricks opened a new pull request, #37573:
URL: https://github.com/apache/spark/pull/37573

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   
   `TaskContext` currently defines two sets of functions for registering listeners:
   ```scala
   def addTaskCompletionListener(listener: TaskCompletionListener): TaskContext
   def addTaskCompletionListener[U](f: (TaskContext) => U): TaskContext
   
   def addTaskFailureListener(listener: TaskFailureListener): TaskContext
   def addTaskFailureListener(f: (TaskContext, Throwable) => Unit): TaskContext
   ```
   
   Before JDK8, the overloads were a convenient way to register a new listener without having to instantiate a new class. However, with the introduction of [functional interfaces](https://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html#approach5) in JDK8+, the two signatures are now equivalent, because a function whose signature matches the only method of a functional interface can be used in place of that interface. 
   
   Result: cryptic ambiguous overload errors when trying to use the function-only overload, which prompted a [scala bug report](https://github.com/scala/bug/issues/11016) (which was never addressed), as well as an attempted workaround that makes `addTaskCompletionListener` gratuitously generic, so that the compiler no longer considers e.g. `addTaskCompletionListener[Unit]` as equivalent to the overload that accepts a `TaskFailureListener`. The latter workaround was never applied to `addTaskFailureListener` for some reason). 
   
   Now that JDK8 is the minimum supported version, we can dispense with the overloads and rely entirely on functional interfaces to work as expected. The vast majority of call sites can now use the function form instead of the class form, which simplifies the code considerably.
   
   While we're at it, standardize the call sites. One-liners use the functional form:
   ```scala
   addTaskCompletionListener(_ => ...)
   ```
   while multi-liners use the block form:
   ```scala
   addTaskCompletionListener { _ =>
     ...
   }
   ```
   
   ### Why are the changes needed?
   
   JDK8+ functional interfaces conflict with the existing overloads. The task listener interface becomes simpler and easier to use if we align with the JDK.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Developers who rely on this developer API will need to remove the gratuitous `[Unit]` when registering functions as listeners.
   
   ### How was this patch tested?
   
   All use sites in the spark code base have been updated to use the new mechanism. The fact that they continue to compile is the strongest evidence that the change worked. The tests that exercise the changed code sites also verify correctness. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads
URL: https://github.com/apache/spark/pull/37573


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JoshRosen commented on a diff in pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads

Posted by GitBox <gi...@apache.org>.
JoshRosen commented on code in PR #37573:
URL: https://github.com/apache/spark/pull/37573#discussion_r950405113


##########
core/src/main/scala/org/apache/spark/TaskContext.scala:
##########
@@ -113,48 +115,21 @@ abstract class TaskContext extends Serializable {
    *
    * Exceptions thrown by the listener will result in failure of the task.
    */
+  @DeveloperApi

Review Comment:
   I confirmed that the TaskCompletionListener and TaskFailureListener classes themselves have been marked as `DeveloperApi` since ~2014/2015 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #37573:
URL: https://github.com/apache/spark/pull/37573#issuecomment-1332933221

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37573:
URL: https://github.com/apache/spark/pull/37573#issuecomment-1220343453

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ryan-johnson-databricks commented on pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads

Posted by GitBox <gi...@apache.org>.
ryan-johnson-databricks commented on PR #37573:
URL: https://github.com/apache/spark/pull/37573#issuecomment-1220705114

   > Hi, @ryan-johnson-databricks . Apache Spark uses the PR contributor's GitHub Action resources instead of Apache Spark GitHub Action resources. Please enable GitHub Action in your repository. Currently, it seems to be disabled in your repo.
   > 
   > <img alt="Screen Shot 2022-08-18 at 5 30 49 PM" width="765" src="https://user-images.githubusercontent.com/9700541/185517574-aee078bb-7153-4b04-8958-3ca1c4f0fd41.png">
   
   I _did_ enable it -- and it even worked briefly -- but something seems to have gone wrong. I have an open ticket w/ github support but they've not yet responded.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #37573:
URL: https://github.com/apache/spark/pull/37573#issuecomment-1222819333

   Since this is a source level incompatibility (as per @JoshRosen's analysis), and given this has been around for a while as a `DeveloperApi`, I would mark it as deprecated - and relook at removing it in the next major release.
   We can ofcourse cleanup our own use 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #37573:
URL: https://github.com/apache/spark/pull/37573#issuecomment-1220725627

   This is developer API so I'm fine with this cleanup. Can you push an empty commit to retrigger the Github Action tests?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37573:
URL: https://github.com/apache/spark/pull/37573#discussion_r949734277


##########
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala:
##########
@@ -115,11 +115,7 @@ private[sql] class AvroFileFormat extends FileFormat
 
         // Ensure that the reader is closed even if the task fails or doesn't consume the entire
         // iterator of records.
-        Option(TaskContext.get()).foreach { taskContext =>
-          taskContext.addTaskCompletionListener[Unit] { _ =>
-            reader.close()
-          }
-        }
+        Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => reader.close()))

Review Comment:
   are these cleanup possible without removing the overloads with scala lambda?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ryan-johnson-databricks commented on pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads

Posted by GitBox <gi...@apache.org>.
ryan-johnson-databricks commented on PR #37573:
URL: https://github.com/apache/spark/pull/37573#issuecomment-1222544502

   > For Spark's own internal development, I'm wondering whether this change will introduce source-compatibility concerns in our own patch backports: If I write a bugfix patch using the new lambda syntax then cherry-pick that patch to older branches then I'll run into compile failures. Of course, _the option_ to be compatible exists but a developer might forget to use it (esp. since IDEs are likely to suggest a replacement to the lambda syntax).
   
   I'm not sure that concern is very troublesome?
   1. The task listener code is _very_ slow-moving -- code involving a task listener tends to not change after being added, other than a half dozen refactors that move/indent it. Full analysis below.
   1. In the last four years, 11 new listeners were added by a half dozen code authors. Each case would have required the author to figure out that `[Unit]` was needed. Meanwhile, there has only been _ONE_ bugfix "backport" since `[Unit]` was added in 2018 -- and that one backport was for one of the 11 new-code changes, from master back to a recent branch cut, to fix a regression found during release testing. So today the new dev cost of keeping `[Unit]` is 10x higher than the backport cost of removing it.
   1. it's pretty easy to add `[Unit]` back in if needed?
   
   (***) By "_very_ slow moving" I mean:
   * There are no prod uses of `addTaskFailureListener` in the spark code base today -- which probably explains why nobody realized it needed to become polymorphic before now.
   * Out of 45 prod files that use `addTaskCompletionListener[Unit]`, 28 have no changes since the Aug 2018 refactor that added `[Unit]` as part of scala-2.12 effort. Of the 17 remaining files (full list below) only one change was a bug fix that would have needed a backport. That change added a new listener (2022-03-09) and merged to master less than a week before spark-3.3 branch cut (2022-03-15); a perf regression was during release testing and fixed by revert (2022-04-26).
   
   List of recent changes to `addTaskCompletionListener[Unit]`:
     * d3d22928d4f in Jun 2022, ArrowConverters.scala (adding a null-check to the task context a listener gets added to, as part of a refactor)
     * 20ffbf7b308 in Apr 2022, DataSourceRDD.scala (refactor-induced indentation change)
     * 6b5a1f9df28 in Apr 2022, ShuffledHashJoinExec.scala (revert a Mar 2022 change to code first added in Aug 2020)
     * 8714eefe6f9 in Aug 2021, Columnar.scala (refactor-induced indentation change)
     * 3257a30e539 in Jun 2021, RocksDB.scala (implement new RocksDB connector)
     * cc9a158712d in Jun 2021, SortMergeJoinExec.scala (added a new completion listener to update a spill size metric)
     * f11950f08f6 in Mar 2021, ExternalSorter.scala (refactor that moved existing code to a new location and also reformatted it)
     * d871b54a4e9 in Jan 2021, ObjectAggregationIterator.scala (added a new completion listener for capturing metrics)
     * 21413b7dd4e in Nov 2020, streaming/state/package.scala (new code)
     * 713124d5e32 in Aug 2020, InMemoryRelation.scala (refactor that moved an existing completion listener)
     * d7b268ab326 in Dec 2019, CollectMetricsExec.scala (added a new completion listener that collects metrics)
     * 05988b256e8 in Sep 2019, BaseArrowPythonRunner.scala (refactor that moved existing code to a new file, otherwise unchanged)
     * 3663dbe5418 in Jul 2019, AvroPartitionReaderFactory.scala (new code, avro DSv2 support)
     * 23ebd389b5c in Jun 2019, ParquetPartitionReaderFactory.scala (new code, parquet DSv2 support)
     * d50603a37c4 in Apr 2019, TextPartitionReaderFactory.scala (new code, text DSv2 support)
     * 8126d09fb5b in Feb 2019, ArrowRunner.scala (new code)
     * 1280bfd7564 in Jan 2019, PipedRDD.scala (bug fix that required adding a new completion listener)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads

Posted by GitBox <gi...@apache.org>.
ryan-johnson-databricks commented on code in PR #37573:
URL: https://github.com/apache/spark/pull/37573#discussion_r950160112


##########
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala:
##########
@@ -115,11 +115,7 @@ private[sql] class AvroFileFormat extends FileFormat
 
         // Ensure that the reader is closed even if the task fails or doesn't consume the entire
         // iterator of records.
-        Option(TaskContext.get()).foreach { taskContext =>
-          taskContext.addTaskCompletionListener[Unit] { _ =>
-            reader.close()
-          }
-        }
+        Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => reader.close()))

Review Comment:
   Generally, yes -- tho they sometimes don't fit on one line until the `[Unit]` gets deleted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #37573:
URL: https://github.com/apache/spark/pull/37573#issuecomment-1220154115

   Another thing to check is that whether this changes enforce end users to rebuild their app jar or not. If we all consider that end users should rebuild their app jar based on the Spark version then that is OK, but we probably may not want to enforce this in every minor version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #37573:
URL: https://github.com/apache/spark/pull/37573#issuecomment-1223351603

   Yep deprecation doesn't help. The caller _can_ use casts to disambiguate it, but that's ugly. I wouldn't object strongly to remove this before 4.0 as it's a developer API, but by the same token, it's a developer API. Is it worth the binary-compatibility breakage vs just having devs use casts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #37573:
URL: https://github.com/apache/spark/pull/37573#issuecomment-1220148766

   This reminds me about so many pair methods for both Scala and Java friendly... Even there are several such methods in DataFrame. When we migrated to Scala 2.12, we figured out such issue, and our decision was let them leave to not break anything. But I agree it would be nice if we could reconsider removal of one of pair methods which can be taken care by Scala.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ryan-johnson-databricks commented on pull request #37573: [SPARK-40141][CORE] Remove unnecessary TaskContext addTaskXxxListener overloads

Posted by GitBox <gi...@apache.org>.
ryan-johnson-databricks commented on PR #37573:
URL: https://github.com/apache/spark/pull/37573#issuecomment-1223300698

   > Since this is a source level incompatibility (as per @JoshRosen's analysis), and given this has been around for a while as a `DeveloperApi`, I would mark it as deprecated - and relook at removing it in the next major release. We can ofcourse cleanup our own use
   
   We can't "just" mark it as deprecated and also clean up our own call sites, because this is an ambiguous overload. If we mark the polymorphic function overload as deprecated, that just forces everyone to either ignore the warning, or to create an actual listener object until we get around to removing the deprecated overload. 
   
   Neither of those seems to provide much benefit?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org