You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "AngersZhuuuu (via GitHub)" <gi...@apache.org> on 2023/11/21 12:32:35 UTC

[PR] [SPARK-46034][BUG] SparkContext add file should also copy file to local root path [spark]

AngersZhuuuu opened a new pull request, #43936:
URL: https://github.com/apache/spark/pull/43936

   ### What changes were proposed in this pull request?
   For below case
   ```
   add jar hdfs://path/search_hadoop_udf-1.0.0-SNAPSHOT.jar;
   add file hdfs://path/feature_map.txt;
   
   CREATE or replace TEMPORARY FUNCTION si_to_fn AS "com.shopee.deep.data_mart.udf.SlotIdToFeatName";
   
   select si_to_fn(k, './feature_map.txt') as feat_name
   from (    select 'slot_8116' as k   ) A; 
   ```
   User use valued-table then the task will running on driver, but driver didn't copy this file to the root path, then failed.
   For `ADD FILE` driver also need to copy one file to root directory.
   
   
   
   ### Why are the changes needed?
   Fix bug
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   MT
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1822478004

   ```
   Fetching hdfs://R2/projects/search_algo/hdfs/dev/typhoon.bo/uploader/ego_config/feature_map.txt to /mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/spark-32bf42ec-8de1-4f0e-aa60-def6d0f185c6/userFiles-61713d37-87ab-42d6-bdcd-065fa65ac8fc/fetchFileTemp8380676662376886133.tmp
   Moving /mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/spark-32bf42ec-8de1-4f0e-aa60-def6d0f185c6/userFiles-61713d37-87ab-42d6-bdcd-065fa65ac8fc/fetchFileTemp8380676662376886133.tmp to /mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/spark-32bf42ec-8de1-4f0e-aa60-def6d0f185c6/userFiles-61713d37-87ab-42d6-bdcd-065fa65ac8fc/feature_map.txt
   Fetching hdfs://R2/projects/search_algo/hdfs/dev/typhoon.bo/uploader/ego_config/feature_map.txt to /mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/container_e2006_1698132018785_8495989_01_000001/./fetchFileTemp7654378954291310228.tmp
   Moving /mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/container_e2006_1698132018785_8495989_01_000001/./fetchFileTemp7654378954291310228.tmp to /mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/container_e2006_1698132018785_8495989_01_000001/./feature_map.txt
   ```
   
   Current code work well in when testing in our env.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "junyi1313 (via GitHub)" <gi...@apache.org>.
junyi1313 commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1829023594

   > The root cause is spark driver download file to it's `driverTempPath`, but didn't download to container's execution root path. So in yarn cluster mode, if we need to use the file in driver side, it throw FileNotFoundException. cc @HyukjinKwon @cloud-fan I have updated the code.
   
   I encounter the same issue too. I think this should be fixed asap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401443129


##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1836,7 +1836,7 @@ class SparkContext(config: SparkConf) extends Logging {
       val uriToUse = if (!isLocal && scheme == "file") uri else new URI(key)
       val uriToDownload = UriBuilder.fromUri(uriToUse).fragment(null).build()
       val source = Utils.fetchFile(uriToDownload.toString, Utils.createTempDir(), conf,
-        hadoopConfiguration, timestamp, useCache = false, shouldUntar = false)
+        hadoopConfiguration, timestamp, useCache = true, shouldUntar = false)

Review Comment:
   We don't need this archive file under root directory - the archive file itself is removed. we only need the unpacked directory



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401448409


##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1836,7 +1836,7 @@ class SparkContext(config: SparkConf) extends Logging {
       val uriToUse = if (!isLocal && scheme == "file") uri else new URI(key)
       val uriToDownload = UriBuilder.fromUri(uriToUse).fragment(null).build()
       val source = Utils.fetchFile(uriToDownload.toString, Utils.createTempDir(), conf,
-        hadoopConfiguration, timestamp, useCache = false, shouldUntar = false)
+        hadoopConfiguration, timestamp, useCache = true, shouldUntar = false)

Review Comment:
   Seems already unpacked, remove this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401447858


##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1836,7 +1836,7 @@ class SparkContext(config: SparkConf) extends Logging {
       val uriToUse = if (!isLocal && scheme == "file") uri else new URI(key)
       val uriToDownload = UriBuilder.fromUri(uriToUse).fragment(null).build()
       val source = Utils.fetchFile(uriToDownload.toString, Utils.createTempDir(), conf,
-        hadoopConfiguration, timestamp, useCache = false, shouldUntar = false)
+        hadoopConfiguration, timestamp, useCache = true, shouldUntar = false)

Review Comment:
   Let me check about this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1827474881

   > Could you please update the PR description? It looks overdue and inaccurate for me to catch up with.
   > 
   > Could you also provide the output for `LIST FILE` both before and after this? I guess we shall use its output at the caller-side, let's verify that they match.
   
   The result of `LIST FILES`, won't have impact since we update `addedFiles` with origin path
   ```
   spark-sql> add jar hdfs://path/search_hadoop_udf-1.0.0-SNAPSHOT.jar;
   Time taken: 3.055 seconds
   spark-sql> add file hdfs://path/feature_map.txt;
   spark-sql> list jars;
   hdfs://path/search_hadoop_udf-1.0.0-SNAPSHOT.jar
   Time taken: 0.443 seconds, Fetched 1 row(s)
   spark-sql> list files;
   hdfs://path/feature_map.txt
   Time taken: 0.428 seconds, Fetched 1 row(s)
   spark-sql>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #43936: [SPARK-46034][CORE] SparkContext add file should also copy file to local root path
URL: https://github.com/apache/spark/pull/43936


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401447419


##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
       logInfo(s"Added file $path at $key with timestamp $timestamp")
       // Fetch the file locally so that closures which are run on the driver can still use the
       // SparkFiles API to access files.
-      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false)
+      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true)

Review Comment:
   > So are you saying that it doesn't fetch the file to `root` directory?
   
   didn't copy as same filename



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1827039250

   gentle ping @yaooqinn @cloud-fan @HyukjinKwon @tgravescs Could you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1827485987

   I don't know why we add a `driverTmpDir`, remove `driverTmpDir` also can resolve this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401427353


##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
       logInfo(s"Added file $path at $key with timestamp $timestamp")
       // Fetch the file locally so that closures which are run on the driver can still use the
       // SparkFiles API to access files.
-      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false)
+      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true)

Review Comment:
   Executor log when `updateDependencies`
   ```
   23/11/21 17:44:55 INFO Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/fetchFileTemp5380393885914736245.tmp
   23/11/21 17:44:55 INFO Utils: Copying /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/-17061381181700559593903_cache to /mnt/ssd/1/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_000683/./feature_map.txt
   ```
   
   In executor side, pass `useCache = true` when is not local mode, then executor will fetch the file to cache then copy cache file to root dir with filename.
   
   For sparkcontext dirver, current code pass `useCache=false` only fetch file as  file temp
   ```
   23/11/21 17:39:53 INFO [pool-3-thread-2] SparkContext: Added file hdfs://path/feature_map.txt at hdfs://path/feature_map.txt with timestamp 1700559593903
   23/11/21 17:39:54 INFO [pool-3-thread-2] Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/0/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp
   ```
   
   So the added file won't exist under root dir with it's filename.
   The code of `Utils.fetchFile()` as below
   <img width="1110" alt="截屏2023-11-22 上午10 21 58" src="https://github.com/apache/spark/assets/46485123/68f6e2f9-a6e2-493d-bd65-d7b2cc88fadd">
   
   
   It's clear that executor is local should pass `useCache=false` since in local mode, it should use file fetched by sc.
   But current code, sc won't add this file with it's file name.
   
   So I think should be like
   
   1. SC add file should also copy file to root dir with the file name, then driver side also can get the file with file name then can run local task in driver
   2. For non-local mode executor will also update the dependencies and work well
   3. For local mode executor, it was started in driver process. It can use the file downloaded by `SC.addFile()`
   
   



##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
       logInfo(s"Added file $path at $key with timestamp $timestamp")
       // Fetch the file locally so that closures which are run on the driver can still use the
       // SparkFiles API to access files.
-      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false)
+      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true)

Review Comment:
   Executor log when `updateDependencies`
   ```
   23/11/21 17:44:55 INFO Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/fetchFileTemp5380393885914736245.tmp
   23/11/21 17:44:55 INFO Utils: Copying /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/-17061381181700559593903_cache to /mnt/ssd/1/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_000683/./feature_map.txt
   ```
   
   In executor side, pass `useCache = true` when is not local mode, then executor will fetch the file to cache then copy cache file to root dir with filename.
   
   For sparkcontext dirver, current code pass `useCache=false` only fetch file as  file temp
   ```
   23/11/21 17:39:53 INFO [pool-3-thread-2] SparkContext: Added file hdfs://path/feature_map.txt at hdfs://path/feature_map.txt with timestamp 1700559593903
   23/11/21 17:39:54 INFO [pool-3-thread-2] Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/0/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp
   ```
   
   So the added file won't exist under root dir with it's filename.
   The code of `Utils.fetchFile()` as below
   <img width="1110" alt="截屏2023-11-22 上午10 21 58" src="https://github.com/apache/spark/assets/46485123/68f6e2f9-a6e2-493d-bd65-d7b2cc88fadd">
   
   
   It's clear that executor is local should pass `useCache=false` since in local mode, it should use file fetched by sc.
   But current code, sc won't add this file with it's file name.
   
   So I think should be like
   
   1. SC add file should also copy file to root dir with the file name, then driver side also can get the file with file name then can run local task in driver
   2. For non-local mode executor will also update the dependencies and work well
   3. For local mode executor, it was started in driver process. It can use the file downloaded by `SC.addFile()`
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][BUG] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1820842254

   test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401447578


##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
       logInfo(s"Added file $path at $key with timestamp $timestamp")
       // Fetch the file locally so that closures which are run on the driver can still use the
       // SparkFiles API to access files.
-      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false)
+      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true)

Review Comment:
   Here is the log



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1825190497

   > Can we fix `SparkFiles.get` at driver side when Yarn cluster is used?
   
   `SparkFiles.get()` don't have problem. 
   The problem is user use relative path to use the added file.
   But `SparkContext.addFile` add to driverTempDir
   ```
     def getRootDirectory(): String =
       SparkEnv.get.driverTmpDir.getOrElse(".")
   ```
   
   But executing use `.` path to get file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401465887


##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
       logInfo(s"Added file $path at $key with timestamp $timestamp")
       // Fetch the file locally so that closures which are run on the driver can still use the
       // SparkFiles API to access files.
-      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false)
+      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true)

Review Comment:
   Can you point out which code we assign the temp name? `doFetchFile(url, targetDir, fileName, conf, hadoopConf)` should preserve the file name



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1822369346

   > cc @mridulm or @tgravescs have you ever seen such things before?: `SparkContext.addFiles` adds a file with a temporary name, and you cannot get it with the original name from `SparkFiles.get(...)`
   
   I have checked, this is a mistake,  since it use move file logic, so there is no log.
   The true reason is as  https://github.com/apache/spark/pull/43936#issuecomment-1822339813


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401427353


##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
       logInfo(s"Added file $path at $key with timestamp $timestamp")
       // Fetch the file locally so that closures which are run on the driver can still use the
       // SparkFiles API to access files.
-      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false)
+      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true)

Review Comment:
   Executor log when `updateDependencies`
   ```
   23/11/21 17:44:55 INFO Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/fetchFileTemp5380393885914736245.tmp
   23/11/21 17:44:55 INFO Utils: Copying /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/-17061381181700559593903_cache to /mnt/ssd/1/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_000683/./feature_map.txt
   ```
   
   In executor side, pass `useCache = true` when is not local mode, then executor will fetch the file to cache then copy cache file to root dir with filename.
   
   For sparkcontext dirver, current code pass `useCache=false` only fetch file as  file temp
   ```
   23/11/21 17:39:53 INFO [pool-3-thread-2] SparkContext: Added file hdfs://path/feature_map.txt at hdfs://path/feature_map.txt with timestamp 1700559593903
   23/11/21 17:39:54 INFO [pool-3-thread-2] Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/0/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp
   ```
   
   So the added file won't exist under root dir with it's filename.
   The code of `Utils.fetchFile()` as below
   <img width="1110" alt="截屏2023-11-22 上午10 21 58" src="https://github.com/apache/spark/assets/46485123/68f6e2f9-a6e2-493d-bd65-d7b2cc88fadd">
   
   
   It's clear that executor is local should pass `useCache=false` since in local mode, it should use file fetched by sc.
   But current code, sc won't add this file with it's file name.
   
   So I think should be like
   
   1. SC add file should also copy file to root dir with the file name, then driver side also can get the file with file name then can run local task in driver
   2. For non-local mode executor will also update the dependencies and work well
   3. For local mode executor, it was started in driver process. It can use the file downloaded by `SC.addFile()`
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][BUG] SparkContext add file should also copy file to local root path [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1400757260


##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
       logInfo(s"Added file $path at $key with timestamp $timestamp")
       // Fetch the file locally so that closures which are run on the driver can still use the
       // SparkFiles API to access files.
-      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false)
+      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true)

Review Comment:
   why changing the `useCache` flag fixes the problem?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1825009537

   Can we fix `SparkFiles.get` at driver side when Yarn cluster is used?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1822356253

   cc @mridulm have you ever seen such things before?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1984826472

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1822401850

   The root cause is spark driver download file to it's `driverTempPath`, but didn't download to container's execution root path. 
   So in yarn cluster mode, if we need to use the file in driver side, it throw FileNotFoundException. cc @HyukjinKwon @cloud-fan I have updated the code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1824131573

   Any more suggestion? cc @HyukjinKwon @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "tgravescs (via GitHub)" <gi...@apache.org>.
tgravescs commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1828063684

   I have not used addFiles on yarn in a long time so can't speak to whether it got broken.   Generally speaking it's not recommended and user should pass files on submission.  Whatever happens here we need to make sure it works in both cluster and client modes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1829030542

   @AngersZhuuuu can you fix the PR description, and explain how this issue happens specifically in Yarn cluster mode? I still can't fully follow what where is the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][BUG] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1820839606

   ping @cloud-fan @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401445511


##########
core/src/main/scala/org/apache/spark/SparkContext.scala:
##########
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
       logInfo(s"Added file $path at $key with timestamp $timestamp")
       // Fetch the file locally so that closures which are run on the driver can still use the
       // SparkFiles API to access files.
-      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false)
+      Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true)

Review Comment:
   So are you saying that it doesn't fetch the file to `root` directory?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1822339813

   Should be I make a mistake.
   
   In driver side, file was download and copy to path driverTempPath
   ```
   /mnt/ssd/0/yarn/nm-local-dir/usercache/typhoon.bo/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp
   ```
   
   but when execute the udf, it use container's path
   ```
   /mnt/ssd/2/yarn/nm-local-dir/usercache/typhoon.bo/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_000002/
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.
yaooqinn commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1827051972

   Could you please update the PR description? It looks overdue and inaccurate for me to catch up with.
   
   Could you also provide the output for `LIST FILE`? I guess we shall use its output at the caller-side, let's verify that they match.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org