You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yunus Emre Gürses (Jira)" <ji...@apache.org> on 2023/10/12 12:53:00 UTC
[jira] [Created] (SPARK-45519) cleanSource problem on FileStreamSource for Windows env
Yunus Emre Gürses created SPARK-45519:
-----------------------------------------
Summary: cleanSource problem on FileStreamSource for Windows env
Key: SPARK-45519
URL: https://issues.apache.org/jira/browse/SPARK-45519
Project: Spark
Issue Type: Bug
Components: Structured Streaming
Affects Versions: 3.4.1
Reporter: Yunus Emre Gürses
We are using Spark with Scala in Windows environment. While streaming using Spark, I give the *{{cleanSource}}* option as "archive" and the *{{sourceArchiveDir}}* option as "archived" as in the code below.
{code:java}
spark.readStream
.option("cleanSource", "archive")
.option("sourceArchiveDir", "archived"){code}
When I tried this in a Linux environment, I realized that the problem was with the paths. Because when I set archive mode to "delete", it works on both Linux and Windows. But for the archive mode, it does not work on Windows.
The problem is related to appending paths in Windows. There is a method
{code:java}
override protected def cleanTask(entry: FileEntry): Unit{code}
in the FileStreamSource.scala file in the org.apache.spark.sql.execution.streaming package. On line 569, the !fileSystem.rename(curPath, newPath) code supposed to move source file to archive folder. However, when I debugged, I noticed that the curPath and newPath values were as follows in windows:
{code:java}
curPath: file:/C:/dev/be/data-integration-suite/test-data/streaming-folder/patients/patients-success.csv{code}
{code:java}
newPath: file:/C:/dev/be/data-integration-suite/archived/C:/dev/be/data-integration-suite/test-data/streaming-folder/patients/patients-success.csv{code}
It seems that absolute path of csv file were appended when creating newPath because there are two *C:/dev/be/data-integration-suite* in the newPath. This is the reason probably spark archiving does not work. Instead, newPath should be: file:/C:/dev/be/data-integration-suite/archived/test-data/streaming-folder/patients/patients-success.csv
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org