You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Femi Anthony <fe...@gmail.com> on 2016/11/05 21:58:46 UTC

Reading csv files with quoted fields containing embedded commas

Hi, I am trying to process a very large comma delimited csv file and I am
running into problems.
The main problem is that some fields contain quoted strings with embedded
commas.
It seems as if PySpark is unable to properly parse lines containing such
fields like say Pandas does.

Here is the code I am using to read the file in Pyspark

df_raw=spark.read.option("header","true").csv(csv_path)

Here is an example of a good and 'bad' line in such a file:



col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19
80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY
""W""   JK, RE
LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023-05-17,CODERED
61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,2019-11-23,CODEBLUE

Line 0 is the header
Line 1 is the 'problematic' line
Line 2 is a good line.

Pandas can handle this easily:


[1]: import pandas as pd

In [2]: pdf = pd.read_csv('malformed_data.csv')

In [4]: pdf[['col12','col13','col14']]
Out[4]:
                    col12                                             col13
 \
0  32 XIY "W"   JK, RE LK  SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE
1                     NaN                     OUTKAST#THROOTS~WUTANG#RUNDMC

   col14
0   23.0
1    0.0


while Pyspark seems to parse this erroneously:

[5]:
sdf=spark.read.format("org.apache.spark.csv").csv('malformed_data.csv',header=True)

[6]: sdf.select("col12","col13",'col14').show()
+------------------+--------------------+--------------------+
|             col12|               col13|               col14|
+------------------+--------------------+--------------------+
|"32 XIY ""W""   JK|              RE LK"|SOMETHINGLIKEAPHE...|
|              null|OUTKAST#THROOTS~W...|                 0.0|
+------------------+--------------------+--------------------+

 Is this a bug or am I doing something wrong ?
 I am working with Spark 2.0
 Any help is appreciated

Thanks,
-- Femi

http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.

Re: Reading csv files with quoted fields containing embedded commas

Posted by Femi Anthony <fe...@gmail.com>.
The quote options seem to be related to escaping quotes and the dataset
isn't escaaping quotes. As I said quoted strings with embedded commas is
something that pandas handles easily, and even Excel does that as well.


Femi

On Sun, Nov 6, 2016 at 6:59 AM, Hyukjin Kwon <gu...@gmail.com> wrote:

> Hi Femi,
>
> Have you maybe tried the quote related options specified in the
> documentation?
>
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.
> html#pyspark.sql.DataFrameReader.csv
>
> Thanks.
>
> 2016-11-06 6:58 GMT+09:00 Femi Anthony <fe...@gmail.com>:
>
>> Hi, I am trying to process a very large comma delimited csv file and I am
>> running into problems.
>> The main problem is that some fields contain quoted strings with embedded
>> commas.
>> It seems as if PySpark is unable to properly parse lines containing such
>> fields like say Pandas does.
>>
>> Here is the code I am using to read the file in Pyspark
>>
>> df_raw=spark.read.option("header","true").csv(csv_path)
>>
>> Here is an example of a good and 'bad' line in such a file:
>>
>>
>>     col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col
>> 12,col13,col14,col15,col16,col17,col18,col19
>> 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY
>> ""W""   JK, RE LK",SOMETHINGLIKEAPHENOMENON#Y
>> OUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023
>> -05-17,CODERED
>> 61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUT
>> KAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,201
>> 9-11-23,CODEBLUE
>>
>> Line 0 is the header
>> Line 1 is the 'problematic' line
>> Line 2 is a good line.
>>
>> Pandas can handle this easily:
>>
>>
>> [1]: import pandas as pd
>>
>> In [2]: pdf = pd.read_csv('malformed_data.csv')
>>
>> In [4]: pdf[['col12','col13','col14']]
>> Out[4]:
>>                     col12
>> col13  \
>> 0  32 XIY "W"   JK, RE LK  SOMETHINGLIKEAPHENOMENON#YOUG
>> OTSOUL~BRINGDANOISE
>> 1                     NaN
>> OUTKAST#THROOTS~WUTANG#RUNDMC
>>
>>    col14
>> 0   23.0
>> 1    0.0
>>
>>
>> while Pyspark seems to parse this erroneously:
>>
>> [5]: sdf=spark.read.format("org.apache.spark.csv").csv('malformed
>> _data.csv',header=True)
>>
>> [6]: sdf.select("col12","col13",'col14').show()
>> +------------------+--------------------+--------------------+
>> |             col12|               col13|               col14|
>> +------------------+--------------------+--------------------+
>> |"32 XIY ""W""   JK|              RE LK"|SOMETHINGLIKEAPHE...|
>> |              null|OUTKAST#THROOTS~W...|                 0.0|
>> +------------------+--------------------+--------------------+
>>
>>  Is this a bug or am I doing something wrong ?
>>  I am working with Spark 2.0
>>  Any help is appreciated
>>
>> Thanks,
>> -- Femi
>>
>> http://www.nextmatrix.com
>> "Great spirits have always encountered violent opposition from mediocre
>> minds." - Albert Einstein.
>>
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.

Re: Reading csv files with quoted fields containing embedded commas

Posted by Hyukjin Kwon <gu...@gmail.com>.
Hi Femi,

Have you maybe tried the quote related options specified in the
documentation?

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv

Thanks.

2016-11-06 6:58 GMT+09:00 Femi Anthony <fe...@gmail.com>:

> Hi, I am trying to process a very large comma delimited csv file and I am
> running into problems.
> The main problem is that some fields contain quoted strings with embedded
> commas.
> It seems as if PySpark is unable to properly parse lines containing such
> fields like say Pandas does.
>
> Here is the code I am using to read the file in Pyspark
>
> df_raw=spark.read.option("header","true").csv(csv_path)
>
> Here is an example of a good and 'bad' line in such a file:
>
>
>     col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,
> col12,col13,col14,col15,col16,col17,col18,col19
> 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY
> ""W""   JK, RE LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,
> cyclingstats,2012-25-19,432,2023-05-17,CODERED
> 61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,
> OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,
> 2019-11-23,CODEBLUE
>
> Line 0 is the header
> Line 1 is the 'problematic' line
> Line 2 is a good line.
>
> Pandas can handle this easily:
>
>
> [1]: import pandas as pd
>
> In [2]: pdf = pd.read_csv('malformed_data.csv')
>
> In [4]: pdf[['col12','col13','col14']]
> Out[4]:
>                     col12
> col13  \
> 0  32 XIY "W"   JK, RE LK  SOMETHINGLIKEAPHENOMENON#
> YOUGOTSOUL~BRINGDANOISE
> 1                     NaN                     OUTKAST#THROOTS~WUTANG#RUNDMC
>
>    col14
> 0   23.0
> 1    0.0
>
>
> while Pyspark seems to parse this erroneously:
>
> [5]: sdf=spark.read.format("org.apache.spark.csv").csv('
> malformed_data.csv',header=True)
>
> [6]: sdf.select("col12","col13",'col14').show()
> +------------------+--------------------+--------------------+
> |             col12|               col13|               col14|
> +------------------+--------------------+--------------------+
> |"32 XIY ""W""   JK|              RE LK"|SOMETHINGLIKEAPHE...|
> |              null|OUTKAST#THROOTS~W...|                 0.0|
> +------------------+--------------------+--------------------+
>
>  Is this a bug or am I doing something wrong ?
>  I am working with Spark 2.0
>  Any help is appreciated
>
> Thanks,
> -- Femi
>
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>