You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "steveloughran (via GitHub)" <gi...@apache.org> on 2023/02/28 16:13:36 UTC

[GitHub] [spark] steveloughran opened a new pull request, #40221: [SPARK-41551][SQL] Dynamic/absolute path support in PathOutputCommitters

steveloughran opened a new pull request, #40221:
URL: https://github.com/apache/spark/pull/40221

   
   ### What changes were proposed in this pull request?
   
   Follow on to SPARK-40034, Dynamic/absolute path support in PathOutputCommitters
   
   Dynamic partitioning though the PathOutputCommitProtocol needs to add the target directories to the superclass's partition list else the partition delete doesn't
   take place -the job extends the dataset, rather than replaces it.
   
   Fix:
   * add an `addPartition()` method subclasses can use
   * add a `getPartitions` method to return an immutable
     copy of the list for testing.
   * add the test.
   * Also fix `newTaskTempFileAbsPath(`) to return a path, irrespective of committer type.
   
   In dynamic mode, because the parent dir of an absolute path is deleted, there's a safety check to reject any requests for a file in a parent dir. This
   is something which could be pulled up to HadoopMapReduceCommitProtocol -it needs the same check, if the risk is considered realistic.
   
   The patch now downgrades from failing on dynamic partitioning if the committer doesn't declare it supports it to printing a warning. Why this? well, it
   *does* work through the s3a committers, it's just `O(data)`. If someone does want to do `INSERT OVERWRITE` then they can be allowed to, just warned about
   it. The outcome will be correct except in the case of: "if the driver fails partway through dir rename, only some of the files will be there"
   
   Finally, it update the protocol spec in `HadoopMapReduceCommitProtocol` to cover the dynamic partition job commit in more detail.
   
   ### Why are the changes needed?
   
   
   * The code in SPARK-40034 was incomplete. The new tests show that it works now.
   * dynamic partition jobs aren't actually incorrect through the s3 committers, just slow, so if someone really wants it they should be free to.
   * The `newFileAbsPath()` code is required of all committers, despite its near-total lack of use.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Updates the cloud docs to say that dynamic partition overwrite does work everywhere, just may be really slow.
   
   
   ### How was this patch tested?
   
   New unit tests in `CommitterBindingSuite`; `
   
   New test suite `PathOutputPartitionedWriteSuite extends PartitionedWriteSuite` which runs the `PartitionedWriteSuite` through
   the PathOutputCommitter and with, on hadoop 3.3.5+ the manifest committer chosen.
   
   executed with
   * hadoop 3.3.4
   * RC2 of hadoop 3.3.5 through `-Dhadoop.version=3.3.5 -Psnapshots-and-staging
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #40221: [SPARK-41551][SQL] Dynamic/absolute path support in PathOutputCommitters

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #40221: [SPARK-41551][SQL] Dynamic/absolute path support in PathOutputCommitters
URL: https://github.com/apache/spark/pull/40221


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #40221: [SPARK-41551][SQL] Dynamic/absolute path support in PathOutputCommitters

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #40221:
URL: https://github.com/apache/spark/pull/40221#issuecomment-1583660746

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org