You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/23 14:13:13 UTC

[GitHub] [spark] steveloughran opened a new pull request #30141: SPARK-33230. FileOutputWriter to set jobConf "spark.sql.sources.write.jobUUID" to description.uuid

steveloughran opened a new pull request #30141:
URL: https://github.com/apache/spark/pull/30141


   ### What changes were proposed in this pull request?
   
   This reinstates the old option `spark.sql.sources.write.jobUUID` to set a unique jobId in the jobconf so that hadoop MR committers have a unique ID which is (a) consistent across tasks and workers and (b) not brittle compared to generated-timestamp job IDs. The latter matches that of what JobID requires, but as they are generated per-thread, may not always be unique within a cluster.
   
   Testing: no test here. You'd have to create a new committer which extracted the value in both job and task(s) and verified consistency. That is possible (with a task output whose records contained the UUID), but it would you be pretty convoluted and a high maintenance cost.
          
   ### Why are the changes needed?
   
   If a committer (e.g s3a staging committer) uses job-attempt-ID as a unique ID then any two jobs started within the same second have the same ID, so can clash.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Good Q. It is "developer-facing" in the context of anyone writing a committer. But it reinstates a property which was in Spark 1.x and "went away"
   
   ### How was this patch tested?
   
   Testing: no test here. You'd have to create a new committer which extracted the value in both job and task(s) and verified consistency. That is possible (with a task output whose records contained the UUID), but it would you be pretty convoluted and a high maintenance cost.
          
   Because it's trying to address a race condition, it's hard to regenerate the problem downstream and so verify a fix in a test run...I'll just look at the logs to see what temporary dir is being used in the cluster FS and verify it's a UUIDYou


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-715370919


   **[Test build #130204 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130204/testReport)** for PR 30141 at commit [`cfbb49d`](https://github.com/apache/spark/commit/cfbb49d7155f50bcbe82e436d7db8447aa53d920).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30141: SPARK-33230. Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-715412702






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-715513372


   **[Test build #130204 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130204/testReport)** for PR 30141 at commit [`cfbb49d`](https://github.com/apache/spark/commit/cfbb49d7155f50bcbe82e436d7db8447aa53d920).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30141:
URL: https://github.com/apache/spark/pull/30141#discussion_r510985437



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
##########
@@ -164,6 +164,10 @@ object FileFormatWriter extends Logging {
 
     SQLExecution.checkSQLExecutionId(sparkSession)
 
+    // propagate the decription UUID into the jobs, so that committers
+    // get an ID guaranteed to be unique.
+    job.getConfiguration.set("spark.sql.sources.writeJobUUID", description.uuid)

Review comment:
       Do you mean Apache Hadoop S3A committer use this?  If not, `spark.sql.sources.writeJobUUID` is not used inside Spark 3.x/2.x code base.
   
   In Spark 1.6, it was a part of file name.
   ```scala
   val uniqueWriteJobId = conf.get("spark.sql.sources.writeJobUUID")
   ...
   val filename = f"part-r-$partition%05d-$uniqueWriteJobId.orc"
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30141: SPARK-33230. Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-715418702


   Thank you, @steveloughran !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-715515522






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-716775382


   cc @cloud-fan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] steveloughran commented on pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
steveloughran commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-716544258


   @rdblue I am going to add two things to the committers
   
   1. Autogenerate in job setup.  More specifically -generate in constructor, and in task setup, fail if that path was generated locally and the same committer hasn't been used for jobSet/jobAbort (testing, primarily)
   1. add an option to fail fast if the spark option hasn't been set. That will let me do some regression testing on the spark code by setting that in downstream tests.
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] steveloughran commented on pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
steveloughran commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-717324176


   Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30141:
URL: https://github.com/apache/spark/pull/30141#discussion_r510985437



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
##########
@@ -164,6 +164,10 @@ object FileFormatWriter extends Logging {
 
     SQLExecution.checkSQLExecutionId(sparkSession)
 
+    // propagate the decription UUID into the jobs, so that committers
+    // get an ID guaranteed to be unique.
+    job.getConfiguration.set("spark.sql.sources.writeJobUUID", description.uuid)

Review comment:
       Do you mean Apache Hadoop S3A committer use this? `spark.sql.sources.writeJobUUID` is not used inside Spark code base.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-715449287


   I think what we ended up doing was to generate a UUID in `OutputCommitter.setupJob` and place it in the Hadoop `Configuration`. Now that the configuration is reliable for a given stage, wouldn't that work here instead of using the one from Spark?
   
   I'm okay adding this. It matches what we do in v2 writes. The new API also passes a UUID in, so I think there is a reasonable precedent for it, even if the committer or writer could generate one itself.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] steveloughran commented on a change in pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
steveloughran commented on a change in pull request #30141:
URL: https://github.com/apache/spark/pull/30141#discussion_r511955017



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
##########
@@ -164,6 +164,10 @@ object FileFormatWriter extends Logging {
 
     SQLExecution.checkSQLExecutionId(sparkSession)
 
+    // propagate the decription UUID into the jobs, so that committers
+    // get an ID guaranteed to be unique.
+    job.getConfiguration.set("spark.sql.sources.writeJobUUID", description.uuid)

Review comment:
       It picked it up if set, so yes, it was being used. We've hit a problem where if >1 job kicks off in the same second for that user, the generated app ID is the same for both, so the staging committers end up using the same dir in HDFS. The committers already use the writeJobUUID property if set: restoring the original config option will mean that the shipping artifacts will work




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] steveloughran commented on pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
steveloughran commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-716545472


   @dongjoon-hyun 
   > Do you think you can add a test case, @steveloughran ?
   
   not easily. Would need a new Hadoop committer (subclass of FileOutputCommitter easiest) which then failed if the option wasn't set on spark queries, *and somewhere to put that*. If you've got suggestions as to where I could put it & point a test I could work off, I'll do my best. I like ScalaTest. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] steveloughran edited a comment on pull request #30141: SPARK-33230. Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
steveloughran edited a comment on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-715380385


   Moving the generation down into the HadoopMapReduceCommitProtocol so that wherever a job is set up (SQL, RDD) they get a consistent URI.
   
   I'm going to modify the S3A Staging committer to have an option which requires the UUID to be set. This can be used as a way to verify that the property is propagating correctly. Consistent setting across jobs and tasks will be inferred simply by whether jobs complete with the expected set of files


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30141: SPARK-33230. Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-715412681


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34805/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30141: SPARK-33230. FileOutputWriter to set jobConf "spark.sql.sources.write.jobUUID" to description.uuid

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-715370919


   **[Test build #130204 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130204/testReport)** for PR 30141 at commit [`cfbb49d`](https://github.com/apache/spark/commit/cfbb49d7155f50bcbe82e436d7db8447aa53d920).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-716656387


   cc @sunchao 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] steveloughran commented on pull request #30141: SPARK-33230. FileOutputWriter to set jobConf "spark.sql.sources.write.jobUUID" to description.uuid

Posted by GitBox <gi...@apache.org>.
steveloughran commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-715368928


   @dongjoon-hyun FYI
   @rdblue  -think this is regressing some of your old code.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30141: SPARK-33230. FileOutputWriter to set jobConf "spark.sql.sources.write.jobUUID" to description.uuid

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-715399610


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34805/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-715515522






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30141: SPARK-33230. Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-715412702






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #30141:
URL: https://github.com/apache/spark/pull/30141


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30141:
URL: https://github.com/apache/spark/pull/30141#discussion_r510985437



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
##########
@@ -164,6 +164,10 @@ object FileFormatWriter extends Logging {
 
     SQLExecution.checkSQLExecutionId(sparkSession)
 
+    // propagate the decription UUID into the jobs, so that committers
+    // get an ID guaranteed to be unique.
+    job.getConfiguration.set("spark.sql.sources.writeJobUUID", description.uuid)

Review comment:
       Do you mean Apache Hadoop S3A committer use this?  If not, `spark.sql.sources.writeJobUUID` is not used inside Spark 3.x/2.x code base.
   
   As you wrote, in Spark 1.6, it was a part of file name explicitly.
   ```scala
   val uniqueWriteJobId = conf.get("spark.sql.sources.writeJobUUID")
   ...
   val filename = f"part-r-$partition%05d-$uniqueWriteJobId.orc"
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] steveloughran commented on pull request #30141: SPARK-33230. FileOutputWriter to set jobConf "spark.sql.sources.write.jobUUID" to description.uuid

Posted by GitBox <gi...@apache.org>.
steveloughran commented on pull request #30141:
URL: https://github.com/apache/spark/pull/30141#issuecomment-715380385


   Looking at this some more, the other place to set it would be org.apache.spark.internal.io.cloud.PathOutputCommitProtocol , where its set from the jobID there (which is also a UUID, just a different one, generated earlier). 
   
   It would only be visible to committers created that way, but given that the classic file output formats don't use job ID values in any way, that should not be an issue. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30141:
URL: https://github.com/apache/spark/pull/30141#discussion_r510985437



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
##########
@@ -164,6 +164,10 @@ object FileFormatWriter extends Logging {
 
     SQLExecution.checkSQLExecutionId(sparkSession)
 
+    // propagate the decription UUID into the jobs, so that committers
+    // get an ID guaranteed to be unique.
+    job.getConfiguration.set("spark.sql.sources.writeJobUUID", description.uuid)

Review comment:
       Do you mean Apache Hadoop S3A committer use this? `spark.sql.sources.writeJobUUID` is not used inside Spark code base.
   
   In Spark 1.6, it was a part of file name.
   ```scala
   val uniqueWriteJobId = conf.get("spark.sql.sources.writeJobUUID")
   ...
   val filename = f"part-r-$partition%05d-$uniqueWriteJobId.orc"
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30141: [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30141:
URL: https://github.com/apache/spark/pull/30141#discussion_r510985437



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
##########
@@ -164,6 +164,10 @@ object FileFormatWriter extends Logging {
 
     SQLExecution.checkSQLExecutionId(sparkSession)
 
+    // propagate the decription UUID into the jobs, so that committers
+    // get an ID guaranteed to be unique.
+    job.getConfiguration.set("spark.sql.sources.writeJobUUID", description.uuid)

Review comment:
       Do you mean Apache Hadoop S3A committer use this? `spark.sql.sources.writeJobUUID` is not used inside Spark 3.x/2.x code base.
   
   In Spark 1.6, it was a part of file name.
   ```scala
   val uniqueWriteJobId = conf.get("spark.sql.sources.writeJobUUID")
   ...
   val filename = f"part-r-$partition%05d-$uniqueWriteJobId.orc"
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org