You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Yunus Emre Gürses (Jira)" <ji...@apache.org> on 2023/10/12 12:53:00 UTC

[jira] [Created] (SPARK-45519) cleanSource problem on FileStreamSource for Windows env

Yunus Emre Gürses created SPARK-45519:
-----------------------------------------

             Summary: cleanSource problem on FileStreamSource for Windows env
                 Key: SPARK-45519
                 URL: https://issues.apache.org/jira/browse/SPARK-45519
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 3.4.1
            Reporter: Yunus Emre Gürses


We are using Spark with Scala in Windows environment. While streaming using Spark, I give the *{{cleanSource}}* option as "archive" and the *{{sourceArchiveDir}}* option as "archived" as in the code below.
{code:java}
spark.readStream
  .option("cleanSource", "archive")
  .option("sourceArchiveDir", "archived"){code}
When I tried this in a Linux environment, I realized that the problem was with the paths. Because when I set archive mode to "delete", it works on both Linux and Windows. But for the archive mode, it does not work on Windows. 

The problem is related to appending paths in Windows. There is a method

 
{code:java}
override protected def cleanTask(entry: FileEntry): Unit{code}
in the FileStreamSource.scala file in the org.apache.spark.sql.execution.streaming package. On line 569, the !fileSystem.rename(curPath, newPath) code supposed to move source file to archive folder. However, when I debugged, I noticed that the curPath and newPath values were as follows in windows:

 
{code:java}
curPath: file:/C:/dev/be/data-integration-suite/test-data/streaming-folder/patients/patients-success.csv{code}
{code:java}
newPath: file:/C:/dev/be/data-integration-suite/archived/C:/dev/be/data-integration-suite/test-data/streaming-folder/patients/patients-success.csv{code}
It seems that absolute path of csv file were appended when creating newPath because there are two *C:/dev/be/data-integration-suite* in the newPath. This is the reason probably spark archiving does not work. Instead, newPath should be: file:/C:/dev/be/data-integration-suite/archived/test-data/streaming-folder/patients/patients-success.csv



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org