You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Carlos Barahona (JIRA)" <ji...@apache.org> on 2017/11/22 19:21:00 UTC

[jira] [Comment Edited] (SPARK-22578) CSV with quoted line breaks not correctly parsed

    [ https://issues.apache.org/jira/browse/SPARK-22578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16263192#comment-16263192 ] 

Carlos Barahona edited comment on SPARK-22578 at 11/22/17 7:20 PM:
-------------------------------------------------------------------

I didn't realize this was multiLine was disabled by default. However, using the option example you provided I have an upstream problem. While I'm obviously able to read the file using my earlier example, using the 
{code:java}
spark.read.option("multiLine", true).csv("tmp.csv").first
{code}

example, the job crashes with a

{code:java}
Caused by: java.io.FileNotFoundException: File file:tmp.csv does not exist
{code}

I can still turn around and not include the option and am able to read the file and display the first item.

Apologies if I'm missing something, I'm rather new to Spark.


was (Author: crbarahona):
I didn't realize this was multiLine was disabled by default. However, using the option example you provided I have an upstream problem. While I'm obviously able to read the file using my earlier example, using the 
{code:java}
spark.read.option("multiLine", true).csv("tmp.csv").first
{code}

example, the job crashes with a

{code:java}
Caused by: java.ioFileNotFoundException: File file:tmp.csv does not exist
{code}

I can still turn around and not include the option and am able to read the file and display the first item.

Apologies if I'm missing something, I'm rather new to Spark.

> CSV with quoted line breaks not correctly parsed
> ------------------------------------------------
>
>                 Key: SPARK-22578
>                 URL: https://issues.apache.org/jira/browse/SPARK-22578
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Carlos Barahona
>
> I believe the behavior addressed in SPARK-19610 still exists. Using spark 2.2.0, when attempting to read in a CSV file containing a quoted newline, the resulting dataset contains two separate items split along the quoted newline.
> Example text:
> {code:java}
> 4/28/2015 8:01,4/28/2015 8:19,0,100,1078,1,4/28/2015 8:19,email,"Hello
> World", 2,3,4,5
> {code}
> scala> val csvFile = spark.read.csv("file:///path")
> csvFile: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 7 more fields]
> scala> csvFile.first()
> res2: org.apache.spark.sql.Row = [4/28/2015 8:01,4/28/2015 8:19,0,100,1078,1,4/28/2015 8:19,email,Hello]
> scala> csvFile.count()
> res3: Long = 2



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org