You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "AngersZhuuuu (via GitHub)" <gi...@apache.org> on 2023/11/22 02:24:52 UTC

Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

AngersZhuuuu commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401427353


##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
       logInfo(s"Added file $path at $key with timestamp $timestamp")
       // Fetch the file locally so that closures which are run on the driver can still use the
       // SparkFiles API to access files.
-      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false)
+      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true)

Review Comment:
   Executor log when `updateDependencies`
   ```
   23/11/21 17:44:55 INFO Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/fetchFileTemp5380393885914736245.tmp
   23/11/21 17:44:55 INFO Utils: Copying /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/-17061381181700559593903_cache to /mnt/ssd/1/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_000683/./feature_map.txt
   ```
   
   In executor side, pass `useCache = true` when is not local mode, then executor will fetch the file to cache then copy cache file to root dir with filename.
   
   For sparkcontext dirver, current code pass `useCache=false` only fetch file as  file temp
   ```
   23/11/21 17:39:53 INFO [pool-3-thread-2] SparkContext: Added file hdfs://path/feature_map.txt at hdfs://path/feature_map.txt with timestamp 1700559593903
   23/11/21 17:39:54 INFO [pool-3-thread-2] Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/0/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp
   ```
   
   So the added file won't exist under root dir with it's filename.
   The code of `Utils.fetchFile()` as below
   <img width="1110" alt="截屏2023-11-22 上午10 21 58" src="https://github.com/apache/spark/assets/46485123/68f6e2f9-a6e2-493d-bd65-d7b2cc88fadd">
   
   
   It's clear that executor is local should pass `useCache=false` since in local mode, it should use file fetched by sc.
   But current code, sc won't add this file with it's file name.
   
   So I think should be like
   
   1. SC add file should also copy file to root dir with the file name, then driver side also can get the file with file name then can run local task in driver
   2. For non-local mode executor will also update the dependencies and work well
   3. For local mode executor, it was started in driver process. It can use the file downloaded by `SC.addFile()`
   
   



##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
       logInfo(s"Added file $path at $key with timestamp $timestamp")
       // Fetch the file locally so that closures which are run on the driver can still use the
       // SparkFiles API to access files.
-      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false)
+      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true)

Review Comment:
   Executor log when `updateDependencies`
   ```
   23/11/21 17:44:55 INFO Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/fetchFileTemp5380393885914736245.tmp
   23/11/21 17:44:55 INFO Utils: Copying /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/-17061381181700559593903_cache to /mnt/ssd/1/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_000683/./feature_map.txt
   ```
   
   In executor side, pass `useCache = true` when is not local mode, then executor will fetch the file to cache then copy cache file to root dir with filename.
   
   For sparkcontext dirver, current code pass `useCache=false` only fetch file as  file temp
   ```
   23/11/21 17:39:53 INFO [pool-3-thread-2] SparkContext: Added file hdfs://path/feature_map.txt at hdfs://path/feature_map.txt with timestamp 1700559593903
   23/11/21 17:39:54 INFO [pool-3-thread-2] Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/0/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp
   ```
   
   So the added file won't exist under root dir with it's filename.
   The code of `Utils.fetchFile()` as below
   <img width="1110" alt="截屏2023-11-22 上午10 21 58" src="https://github.com/apache/spark/assets/46485123/68f6e2f9-a6e2-493d-bd65-d7b2cc88fadd">
   
   
   It's clear that executor is local should pass `useCache=false` since in local mode, it should use file fetched by sc.
   But current code, sc won't add this file with it's file name.
   
   So I think should be like
   
   1. SC add file should also copy file to root dir with the file name, then driver side also can get the file with file name then can run local task in driver
   2. For non-local mode executor will also update the dependencies and work well
   3. For local mode executor, it was started in driver process. It can use the file downloaded by `SC.addFile()`
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org