You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/14 07:57:51 UTC

[GitHub] [spark] AngersZhuuuu commented on pull request #33828: [SPARK-36579][CORE][SQL] Make spark source stagingDir can be customized

AngersZhuuuu commented on pull request #33828:
URL: https://github.com/apache/spark/pull/33828#issuecomment-1066479723


   > * propose using ` "spark.sql.sources.writeJobUUID` as the job id when set; more uniqueness and it should be set everywhere.
   
   Now all place use spark's job id, I can do this after this pr since it's not the same thing.
   
   > * core design looks ok. but i don't see why you couldn't support concurrent jobs just by having different subdirs of __temporary for different job IDs/UUIDs, and an option to disable cleanup. (and instructions to do it later, which you'd need to do anyway).
   
   Since if two job write to same table's different partition, the have same output path ${table_location}/temporary/0....
   If one job succeed , it will delete that path, then another job's data is lossed.
   
   > * because that use of `__temporary/0` on file output committer is only because on a restart of the MR AM lets the committer use `__temporary/1`  (using app attempt number for the subdir) then moving the committed task data from job attempt 0 to its own dir, so recover all existing work. spark doesn't need that.
   
   This is caused that spark still use FileOutputCommitter, still keep this, if we can rewrite a commit protocol, we can avoid this.
   
   > * it'd be good for you to try out my manifest committer against hdfs with your workloads. it is designed to be a lot faster in job commit because all listing of task output directory trees is done in task commit, and job commit does everything in parallel (listing of manifests, loading of manifests, creating dest dirs, file rename). some of the options you don't need for hdfs (parallel delete of task attempt temp dirs)j, but I still expect a massive speedup of job commit, though not as much as for stores where listing and rename is slower.
   
   Yea, I will try this later, it's a very useful design and can reduce hdfs's pressure a lot. I need to check this with our hdfs team too.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org