You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/10/05 01:01:17 UTC
[jira] [Comment Edited] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv

    [ https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424241#comment-17424241 ] 

Hyukjin Kwon edited comment on SPARK-26786 at 10/5/21, 1:00 AM:
----------------------------------------------------------------

It became a known limitation in univcity parser.


was (Author: hyukjin.kwon):
It became a known limitation in universe.

> Handle to treat escaped newline characters('\r','\n') in spark csv
> ------------------------------------------------------------------
>
>                 Key: SPARK-26786
>                 URL: https://issues.apache.org/jira/browse/SPARK-26786
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: vishnuram selvaraj
>            Priority: Major
>              Labels: bulk-closed
>
> There are some systems like AWS redshift which writes csv files by escaping newline characters('\r','\n') in addition to escaping the quote characters, if they come as part of the data.
> Redshift documentation link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and below is their mention of escaping requirements in the mentioned link
> ESCAPE
> For CHAR and VARCHAR columns in delimited unload files, an escape character (\{{}}) is placed before every occurrence of the following characters:
>  * Linefeed: {{\n}}
>  * Carriage return: {{\r}}
>  * The delimiter character specified for the unloaded data.
>  * The escape character: \{{}}
>  * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are specified in the UNLOAD command).
>  
> *Problem statement:* 
> But the spark CSV reader doesn't have a handle to treat/remove the escape characters infront of the newline characters in the data.
> It would really help if we can add a feature to handle the escaped newline characters through another parameter like (escapeNewline = 'true/false').
> *Example:*
> Below are the details of my test data set up in a file.
>  * The first record in that file has escaped windows newline character (
>  r
>  n)
>  * The third record in that file has escaped unix newline character (
>  n)
>  * The fifth record in that file has the escaped quote character (")
> the file looks like below in vi editor:
>  
> {code:java}
> "1","this is \^M\
> line1"^M
> "2","this is line2"^M
> "3","this is \
> line3"^M
> "4","this is \" line4"^M
> "5","this is line5"^M{code}
>  
> When I read the file in python's csv module with escape, it is able to remove the added escape characters as you can see below,
>  
> {code:java}
> >>> with open('/tmp/test3.csv','r') as readCsv:
> ... readFile = csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
> ... for row in readFile:
> ... print(row)
> ...
> ['1', 'this is \r\n line1']
> ['2', 'this is line2']
> ['3', 'this is \n line3']
> ['4', 'this is " line4']
> ['5', 'this is line5']
> {code}
> But if I read the same file in spark-csv reader, the escape characters infront of the newline characters are not removed.But the escape before the (") is removed.
> {code:java}
> >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
> >>> redDf.show()
> +---+------------------+
> |_c0| _c1|
> +---+------------------+
> \ 1|this is \
> line1|
> | 2| this is line2|
> | 3| this is \
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+------------------+
> {code}
>  *Expected result:*
> {code:java}
> +---+------------------+
> |_c0| _c1|
> +---+------------------+
> | 1|this is 
> line1|
> | 2| this is line2|
> | 3| this is 
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+------------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org