You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/11 12:37:55 UTC

[GitHub] [spark] weixiuli opened a new pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

weixiuli opened a new pull request #35492:
URL: https://github.com/apache/spark/pull/35492


   
   ### What changes were proposed in this pull request?
   
   Replace the stagingDir method with a stagingDir constant  in HadoopMapReduceCommitProtocol.
   
   ### Why are the changes needed?
   The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol to improve performence.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   ### How was this patch tested?
   Pass the CIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] weixiuli commented on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
weixiuli commented on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1036975951


   The stagingDir will be called many times in commitJob, especially in traversing partitionPaths when the dynamicPartitionOverwrite is true.
   https://github.com/apache/spark/blob/25a4c5fa84d64e37cf5c27c7b2f0f29867330bf2/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L218-L236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1054987111


   Reverted, sorry @weixiuli , maybe you were right that this needed transient after all! maybe try it again and check against Hadoop 2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] weixiuli removed a comment on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
weixiuli removed a comment on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1036975951


   The stagingDir will be called many times in commitJob, especially in traversing partitionPaths when the dynamicPartitionOverwrite is true.
   https://github.com/apache/spark/blob/25a4c5fa84d64e37cf5c27c7b2f0f29867330bf2/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L218-L236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] weixiuli commented on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
weixiuli commented on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1053934193


   Thanks for your reivew @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] codecov-commenter edited a comment on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1037139355


   # [Codecov](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#35492](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (25a4c5f) into [master](https://codecov.io/gh/apache/spark/commit/d4a2e5c55d127218f6ae42925443f7d0588d5875?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (d4a2e5c) will **decrease** coverage by `7.69%`.
   > The diff coverage is `n/a`.
   
   > :exclamation: Current head 25a4c5f differs from pull request most recent head 807a639. Consider uploading reports for the commit 807a639 to get more accurate results
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/spark/pull/35492/graphs/tree.svg?width=650&height=150&src=pr&token=R9pHLWgWi8&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master   #35492      +/-   ##
   ==========================================
   - Coverage   91.27%   83.58%   -7.70%     
   ==========================================
     Files         297      257      -40     
     Lines       64021    58306    -5715     
     Branches     9903     9293     -610     
   ==========================================
   - Hits        58437    48733    -9704     
   - Misses       4232     8379    +4147     
   + Partials     1352     1194     -158     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | unittests | `83.56% <ø> (-7.70%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [python/pyspark/join.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3Bhcmsvam9pbi5weQ==) | `12.12% <0.00%> (-81.82%)` | :arrow_down: |
   | [python/pyspark/ml/tuning.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvbWwvdHVuaW5nLnB5) | `25.03% <0.00%> (-67.46%)` | :arrow_down: |
   | [python/pyspark/ml/pipeline.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvbWwvcGlwZWxpbmUucHk=) | `35.53% <0.00%> (-59.40%)` | :arrow_down: |
   | [python/pyspark/util.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvdXRpbC5weQ==) | `30.76% <0.00%> (-54.71%)` | :arrow_down: |
   | [python/pyspark/shuffle.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3Bhcmsvc2h1ZmZsZS5weQ==) | `18.73% <0.00%> (-53.68%)` | :arrow_down: |
   | [python/pyspark/rdd.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvcmRkLnB5) | `40.42% <0.00%> (-52.09%)` | :arrow_down: |
   | [python/pyspark/ml/image.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvbWwvaW1hZ2UucHk=) | `33.33% <0.00%> (-50.01%)` | :arrow_down: |
   | [python/pyspark/ml/wrapper.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvbWwvd3JhcHBlci5weQ==) | `45.85% <0.00%> (-47.32%)` | :arrow_down: |
   | [python/pyspark/streaming/dstream.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3Bhcmsvc3RyZWFtaW5nL2RzdHJlYW0ucHk=) | `37.89% <0.00%> (-46.49%)` | :arrow_down: |
   | [python/pyspark/ml/util.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvbWwvdXRpbC5weQ==) | `42.22% <0.00%> (-45.40%)` | :arrow_down: |
   | ... and [79 more](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [d4a2e5c...807a639](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] weixiuli commented on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
weixiuli commented on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1047509699


   /cc @srowen Can you help me  review this pr ? thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1037793392


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1054910730


   cc @weixiuli @srowen This change would cause test failure with Hadoop2:
   
   ```
   ./build/sbt -mem 4096 -Phadoop-2 "sql/testOnly org.apache.spark.sql.sources.PartitionedWriteSuite -- -z SPARK-27194"
   ```
   
   ```
   ...
   [info]   Cause: org.apache.spark.SparkException: Task not serializable
   ...
   [info]   Cause: java.io.NotSerializableException: org.apache.hadoop.fs.Path
   ...
   ```
   
   It looks like it's because `org.apache.hadoop.fs.Path` is serializable in Hadoop3 but not in Hadoop2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen closed pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
srowen closed pull request #35492:
URL: https://github.com/apache/spark/pull/35492


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] weixiuli commented on a change in pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
weixiuli commented on a change in pull request #35492:
URL: https://github.com/apache/spark/pull/35492#discussion_r812567016



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -104,7 +104,7 @@ class HadoopMapReduceCommitProtocol(
    * The staging directory of this write job. Spark uses it to deal with files with absolute output
    * path, or writing data into partitioned directory with dynamicPartitionOverwrite=true.
    */
-  protected def stagingDir = getStagingDir(path, jobId)
+  @transient protected lazy val stagingDir = getStagingDir(path, jobId)

Review comment:
       I have already checked that  the  OutputCommitCoordinatorSuite will fail when the stagingDir is not lazy.
   
   such as :
   
   ```scala
    test("If commit fails, if task is retried it should not be locked, and will succeed.") {
       val rdd = sc.parallelize(Seq(1), 1)
       sc.runJob(rdd, OutputCommitFunctions(tempDir.getAbsolutePath).failFirstCommitAttempt _,
         0 until rdd.partitions.size)
       assert(tempDir.list().size === 1)
     }
   ```
   
   ```
   Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.0.78.226 executor driver): java.lang.reflect.InvocationTargetException
   	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
   	at org.apache.spark.internal.io.FileCommitProtocol$.instantiate(FileCommitProtocol.scala:228)
   	at org.apache.spark.scheduler.OutputCommitFunctions.runCommitWithProvidedCommitter(OutputCommitCoordinatorSuite.scala:316)
   	at org.apache.spark.scheduler.OutputCommitFunctions.failFirstCommitAttempt(OutputCommitCoordinatorSuite.scala:304)
   	at org.apache.spark.scheduler.OutputCommitCoordinatorSuite.$anonfun$new$8(OutputCommitCoordinatorSuite.scala:148)
   	at org.apache.spark.scheduler.OutputCommitCoordinatorSuite.$anonfun$new$8$adapted(OutputCommitCoordinatorSuite.scala:148)
   	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:136)
   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1475)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.IllegalArgumentException: Can not create a Path from a null string
   	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)
   	at org.apache.hadoop.fs.Path.<init>(Path.java:184)
   	at org.apache.hadoop.fs.Path.<init>(Path.java:119)
   	at org.apache.spark.internal.io.FileCommitProtocol$.getStagingDir(FileCommitProtocol.scala:233)
   	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.<init>(HadoopMapReduceCommitProtocol.scala:107)
   	at org.apache.spark.internal.io.HadoopMapRedCommitProtocol.<init>(HadoopMapRedCommitProtocol.scala:30)
   	... 18 more
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] weixiuli commented on a change in pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
weixiuli commented on a change in pull request #35492:
URL: https://github.com/apache/spark/pull/35492#discussion_r812567016



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -104,7 +104,7 @@ class HadoopMapReduceCommitProtocol(
    * The staging directory of this write job. Spark uses it to deal with files with absolute output
    * path, or writing data into partitioned directory with dynamicPartitionOverwrite=true.
    */
-  protected def stagingDir = getStagingDir(path, jobId)
+  @transient protected lazy val stagingDir = getStagingDir(path, jobId)

Review comment:
       I have already checked that  the  OutputCommitCoordinatorSuite will fail when the stagingDir is not layz.
   
   such as :
   
   ```scala
    test("If commit fails, if task is retried it should not be locked, and will succeed.") {
       val rdd = sc.parallelize(Seq(1), 1)
       sc.runJob(rdd, OutputCommitFunctions(tempDir.getAbsolutePath).failFirstCommitAttempt _,
         0 until rdd.partitions.size)
       assert(tempDir.list().size === 1)
     }
   ```
   
   ```
   Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.0.78.226 executor driver): java.lang.reflect.InvocationTargetException
   	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
   	at org.apache.spark.internal.io.FileCommitProtocol$.instantiate(FileCommitProtocol.scala:228)
   	at org.apache.spark.scheduler.OutputCommitFunctions.runCommitWithProvidedCommitter(OutputCommitCoordinatorSuite.scala:316)
   	at org.apache.spark.scheduler.OutputCommitFunctions.failFirstCommitAttempt(OutputCommitCoordinatorSuite.scala:304)
   	at org.apache.spark.scheduler.OutputCommitCoordinatorSuite.$anonfun$new$8(OutputCommitCoordinatorSuite.scala:148)
   	at org.apache.spark.scheduler.OutputCommitCoordinatorSuite.$anonfun$new$8$adapted(OutputCommitCoordinatorSuite.scala:148)
   	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:136)
   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1475)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.IllegalArgumentException: Can not create a Path from a null string
   	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)
   	at org.apache.hadoop.fs.Path.<init>(Path.java:184)
   	at org.apache.hadoop.fs.Path.<init>(Path.java:119)
   	at org.apache.spark.internal.io.FileCommitProtocol$.getStagingDir(FileCommitProtocol.scala:233)
   	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.<init>(HadoopMapReduceCommitProtocol.scala:107)
   	at org.apache.spark.internal.io.HadoopMapRedCommitProtocol.<init>(HadoopMapRedCommitProtocol.scala:30)
   	... 18 more
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #35492:
URL: https://github.com/apache/spark/pull/35492#discussion_r812015679



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -104,7 +104,7 @@ class HadoopMapReduceCommitProtocol(
    * The staging directory of this write job. Spark uses it to deal with files with absolute output
    * path, or writing data into partitioned directory with dynamicPartitionOverwrite=true.
    */
-  protected def stagingDir = getStagingDir(path, jobId)
+  @transient protected lazy val stagingDir = getStagingDir(path, jobId)

Review comment:
       Why not just `val`? `lazy` may not be worth it. Is it `@transient` because the staging dir may vary across driver and worker?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1054973454


   +1 to revert


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] codecov-commenter edited a comment on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1037139355


   # [Codecov](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#35492](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (25a4c5f) into [master](https://codecov.io/gh/apache/spark/commit/d4a2e5c55d127218f6ae42925443f7d0588d5875?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (d4a2e5c) will **decrease** coverage by `15.90%`.
   > The diff coverage is `n/a`.
   
   > :exclamation: Current head 25a4c5f differs from pull request most recent head 807a639. Consider uploading reports for the commit 807a639 to get more accurate results
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/spark/pull/35492/graphs/tree.svg?width=650&height=150&src=pr&token=R9pHLWgWi8&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff             @@
   ##           master   #35492       +/-   ##
   ===========================================
   - Coverage   91.27%   75.37%   -15.91%     
   ===========================================
     Files         297      211       -86     
     Lines       64021    49479    -14542     
     Branches     9903     8306     -1597     
   ===========================================
   - Hits        58437    37293    -21144     
   - Misses       4232    11185     +6953     
   + Partials     1352     1001      -351     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | unittests | `75.35% <ø> (-15.90%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [python/pyspark/join.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3Bhcmsvam9pbi5weQ==) | `12.12% <0.00%> (-81.82%)` | :arrow_down: |
   | [python/pyspark/sql/observation.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3Bhcmsvc3FsL29ic2VydmF0aW9uLnB5) | `26.08% <0.00%> (-69.57%)` | :arrow_down: |
   | [python/pyspark/ml/tuning.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvbWwvdHVuaW5nLnB5) | `25.03% <0.00%> (-67.46%)` | :arrow_down: |
   | [python/pyspark/rdd.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvcmRkLnB5) | `30.48% <0.00%> (-62.04%)` | :arrow_down: |
   | [python/pyspark/streaming/dstream.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3Bhcmsvc3RyZWFtaW5nL2RzdHJlYW0ucHk=) | `22.65% <0.00%> (-61.72%)` | :arrow_down: |
   | [python/pyspark/util.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvdXRpbC5weQ==) | `24.78% <0.00%> (-60.69%)` | :arrow_down: |
   | [python/pyspark/streaming/util.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3Bhcmsvc3RyZWFtaW5nL3V0aWwucHk=) | `29.62% <0.00%> (-60.50%)` | :arrow_down: |
   | [python/pyspark/resource/requests.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvcmVzb3VyY2UvcmVxdWVzdHMucHk=) | `35.25% <0.00%> (-60.44%)` | :arrow_down: |
   | [python/pyspark/cloudpickle/compat.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvY2xvdWRwaWNrbGUvY29tcGF0LnB5) | `30.00% <0.00%> (-60.00%)` | :arrow_down: |
   | [python/pyspark/ml/pipeline.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvbWwvcGlwZWxpbmUucHk=) | `35.53% <0.00%> (-59.40%)` | :arrow_down: |
   | ... and [158 more](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [d4a2e5c...807a639](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] codecov-commenter edited a comment on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1037139355


   # [Codecov](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#35492](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (25a4c5f) into [master](https://codecov.io/gh/apache/spark/commit/d4a2e5c55d127218f6ae42925443f7d0588d5875?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (d4a2e5c) will **decrease** coverage by `0.00%`.
   > The diff coverage is `n/a`.
   
   > :exclamation: Current head 25a4c5f differs from pull request most recent head 807a639. Consider uploading reports for the commit 807a639 to get more accurate results
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/spark/pull/35492/graphs/tree.svg?width=650&height=150&src=pr&token=R9pHLWgWi8&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master   #35492      +/-   ##
   ==========================================
   - Coverage   91.27%   91.27%   -0.01%     
   ==========================================
     Files         297      297              
     Lines       64021    64021              
     Branches     9903     9903              
   ==========================================
   - Hits        58437    58433       -4     
   - Misses       4232     4233       +1     
   - Partials     1352     1355       +3     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | unittests | `91.24% <ø> (-0.01%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [python/pyspark/streaming/tests/test\_context.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3Bhcmsvc3RyZWFtaW5nL3Rlc3RzL3Rlc3RfY29udGV4dC5weQ==) | `97.63% <0.00%> (-1.58%)` | :arrow_down: |
   | [...n/pyspark/mllib/tests/test\_streaming\_algorithms.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvbWxsaWIvdGVzdHMvdGVzdF9zdHJlYW1pbmdfYWxnb3JpdGhtcy5weQ==) | `76.34% <0.00%> (-0.36%)` | :arrow_down: |
   | [python/pyspark/streaming/tests/test\_dstream.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3Bhcmsvc3RyZWFtaW5nL3Rlc3RzL3Rlc3RfZHN0cmVhbS5weQ==) | `95.24% <0.00%> (-0.24%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [d4a2e5c...807a639](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] weixiuli commented on a change in pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
weixiuli commented on a change in pull request #35492:
URL: https://github.com/apache/spark/pull/35492#discussion_r805112929



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -104,7 +104,7 @@ class HadoopMapReduceCommitProtocol(
    * The staging directory of this write job. Spark uses it to deal with files with absolute output
    * path, or writing data into partitioned directory with dynamicPartitionOverwrite=true.
    */
-  protected def stagingDir = getStagingDir(path, jobId)
+  @transient protected lazy val stagingDir = getStagingDir(path, jobId)

Review comment:
       The stagingDir method will be called many times  in commitJob, especially in traversing partitionPaths when the dynamicPartitionOverwrite is true.  So, we should use a stagingDir constant instead of the  stagingDir method to avoid multiple function calls.
   https://github.com/apache/spark/blob/25a4c5fa84d64e37cf5c27c7b2f0f29867330bf2/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L218-L236




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] codecov-commenter commented on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1037139355


   # [Codecov](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#35492](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (25a4c5f) into [master](https://codecov.io/gh/apache/spark/commit/d4a2e5c55d127218f6ae42925443f7d0588d5875?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (d4a2e5c) will **decrease** coverage by `29.80%`.
   > The diff coverage is `n/a`.
   
   > :exclamation: Current head 25a4c5f differs from pull request most recent head 807a639. Consider uploading reports for the commit 807a639 to get more accurate results
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/spark/pull/35492/graphs/tree.svg?width=650&height=150&src=pr&token=R9pHLWgWi8&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@             Coverage Diff             @@
   ##           master   #35492       +/-   ##
   ===========================================
   - Coverage   91.27%   61.47%   -29.81%     
   ===========================================
     Files         297      202       -95     
     Lines       64021    39860    -24161     
     Branches     9903     7517     -2386     
   ===========================================
   - Hits        58437    24502    -33935     
   - Misses       4232    14151     +9919     
   + Partials     1352     1207      -145     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | unittests | `61.47% <ø> (-29.79%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [python/pyspark/join.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3Bhcmsvam9pbi5weQ==) | `12.12% <0.00%> (-81.82%)` | :arrow_down: |
   | [python/pyspark/sql/observation.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3Bhcmsvc3FsL29ic2VydmF0aW9uLnB5) | `26.08% <0.00%> (-69.57%)` | :arrow_down: |
   | [python/pyspark/ml/tuning.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvbWwvdHVuaW5nLnB5) | `25.03% <0.00%> (-67.46%)` | :arrow_down: |
   | [python/pyspark/rdd.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvcmRkLnB5) | `28.12% <0.00%> (-64.39%)` | :arrow_down: |
   | [python/pyspark/pandas/frame.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvcGFuZGFzL2ZyYW1lLnB5) | `33.03% <0.00%> (-64.03%)` | :arrow_down: |
   | [python/pyspark/streaming/dstream.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3Bhcmsvc3RyZWFtaW5nL2RzdHJlYW0ucHk=) | `22.65% <0.00%> (-61.72%)` | :arrow_down: |
   | [python/pyspark/util.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvdXRpbC5weQ==) | `24.78% <0.00%> (-60.69%)` | :arrow_down: |
   | [python/pyspark/streaming/util.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3Bhcmsvc3RyZWFtaW5nL3V0aWwucHk=) | `29.62% <0.00%> (-60.50%)` | :arrow_down: |
   | [python/pyspark/resource/requests.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvcmVzb3VyY2UvcmVxdWVzdHMucHk=) | `35.25% <0.00%> (-60.44%)` | :arrow_down: |
   | [python/pyspark/cloudpickle/compat.py](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cHl0aG9uL3B5c3BhcmsvY2xvdWRwaWNrbGUvY29tcGF0LnB5) | `30.00% <0.00%> (-60.00%)` | :arrow_down: |
   | ... and [191 more](https://codecov.io/gh/apache/spark/pull/35492/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [d4a2e5c...807a639](https://codecov.io/gh/apache/spark/pull/35492?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] weixiuli commented on a change in pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
weixiuli commented on a change in pull request #35492:
URL: https://github.com/apache/spark/pull/35492#discussion_r812540471



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -104,7 +104,7 @@ class HadoopMapReduceCommitProtocol(
    * The staging directory of this write job. Spark uses it to deal with files with absolute output
    * path, or writing data into partitioned directory with dynamicPartitionOverwrite=true.
    */
-  protected def stagingDir = getStagingDir(path, jobId)
+  @transient protected lazy val stagingDir = getStagingDir(path, jobId)

Review comment:
       If it is not layz may fail, i will try to change it.  The `@transient` may be unnecessary.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] weixiuli commented on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
weixiuli commented on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1055017519


   > maybe you were right that this needed transient after all! maybe try it again and check against Hadoop 2.
   
   OK, i will  check against Hadoop 2 with transient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1054942124


   Ah, darn. OK I think we have to revert it unfortunately


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1037793392


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] weixiuli commented on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
weixiuli commented on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1049423989


   
   
   
   
   > OK, I suppose lazy forces it to be recomputed on the executor?
   
   I think so, and this optimization makes sense for executors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #35492:
URL: https://github.com/apache/spark/pull/35492#issuecomment-1050407075


   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] weixiuli commented on a change in pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Posted by GitBox <gi...@apache.org>.
weixiuli commented on a change in pull request #35492:
URL: https://github.com/apache/spark/pull/35492#discussion_r812659914



##########
File path: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -104,7 +104,7 @@ class HadoopMapReduceCommitProtocol(
    * The staging directory of this write job. Spark uses it to deal with files with absolute output
    * path, or writing data into partitioned directory with dynamicPartitionOverwrite=true.
    */
-  protected def stagingDir = getStagingDir(path, jobId)
+  @transient protected lazy val stagingDir = getStagingDir(path, jobId)

Review comment:
       @srowen 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org