You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/03/05 03:04:00 UTC

[jira] [Resolved] (SPARK-27028) PySpark read .dat file. Multiline issue

     [ https://issues.apache.org/jira/browse/SPARK-27028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-27028.
----------------------------------
    Resolution: Not A Problem

> PySpark read .dat file. Multiline issue
> ---------------------------------------
>
>                 Key: SPARK-27028
>                 URL: https://issues.apache.org/jira/browse/SPARK-27028
>             Project: Spark
>          Issue Type: Question
>          Components: PySpark
>    Affects Versions: 2.4.0
>         Environment: Pyspark(2.4) in AWS EMR
>            Reporter: alokchowdary
>            Priority: Critical
>
> * I am trying to read the dat file using pyspark csv reader and it contains newline character ("\n") as part of the data. Spark is unable to read this file as single column, rather treating it as new row. I tried using the "multiLine" option while reading , but still its not working.
>  * {{spark.read.csv(file_path, schema=schema, sep=delimiter,multiLine=True)}}
>  * {{}}Data is something like this. Every line below is considered as row in dataframe.
>  * Here  '\x01' is actual delimeter(but used , for ease of reading).
> {{ }}
> {{1. name,test,12345,}}
> {{2. x, }}
> {{3. desc }}
> {{4. name2,test2,12345 }}
> {{5. ,y}}
> {{6. ,desc2}}
>  * {{}}So pyspark is treating x and desc as new row in dataframe, with nulls for other columns.
> How to read such data in pyspark 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org