You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/11/01 05:11:00 UTC

[jira] [Assigned] (SPARK-40982) When the value of quote or escape exists in the content of csv file, the character in the csv file will be misidentified

     [ https://issues.apache.org/jira/browse/SPARK-40982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-40982:
------------------------------------

    Assignee: Apache Spark

> When the value of quote or escape exists in the content of csv file, the character in the csv file will be misidentified
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-40982
>                 URL: https://issues.apache.org/jira/browse/SPARK-40982
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.2.1
>            Reporter: clairezhuang
>            Assignee: Apache Spark
>            Priority: Minor
>
> When the value of quote or escape exists in the content of csv file, the character in the csv file will be misidentified
> We found that when the value of quote or escape exists in the content of csv file, the character in the csv file will be misidentified.
> When this content is being read by Azure Data Factory copy activity and written to CSV, the content is
> "test\\" =
> test"
> we read csv as below:
> df = spark.read.csv(path='test.csv'
> , sep=','
> , header=True
> , quote='"'
> , escape='\'
> , multiLine=True
> , lineSep='\n'
> )
> resulting in the following being written to the CSV: *test\" =* and *test* in the next line ,but what we want {*}test\\" = test{*}.
> Now when the above is being read by Spark:
>  # The first \ is being interpreted as being an escaping of the second \ (so the content looks like a single literal )
>  # The " now appears to be an unescaped quote character, so we're back in the situation where Spark tries to handle this using STOP_AT_DELIMITER.
> As before, the rest of the CSV after this point is being parsed incorrectly.
> We could change the "quote,escape..." to avoid it for the scenario above, but the content of their csv file is very large and it may occur any character. the data sources that we have which are affected by this issue are systems outside of our control, so we have no means of controlling what content/characters will be there.When we change the "quote,escape...", it may conflict with the content again, and it still have issues in the following content.
> As far as designing the content to avoid certain characters - the data sources that we have which are affected by this issue are systems outside of our control, so we have no means of controlling what content/characters will be there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org