You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Femi Anthony <fe...@gmail.com> on 2016/11/05 21:58:46 UTC
Reading csv files with quoted fields containing embedded commas
Hi, I am trying to process a very large comma delimited csv file and I am
running into problems.
The main problem is that some fields contain quoted strings with embedded
commas.
It seems as if PySpark is unable to properly parse lines containing such
fields like say Pandas does.
Here is the code I am using to read the file in Pyspark
df_raw=spark.read.option("header","true").csv(csv_path)
Here is an example of a good and 'bad' line in such a file:
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19
80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY
""W"" JK, RE
LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023-05-17,CODERED
61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,2019-11-23,CODEBLUE
Line 0 is the header
Line 1 is the 'problematic' line
Line 2 is a good line.
Pandas can handle this easily:
[1]: import pandas as pd
In [2]: pdf = pd.read_csv('malformed_data.csv')
In [4]: pdf[['col12','col13','col14']]
Out[4]:
col12 col13
\
0 32 XIY "W" JK, RE LK SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE
1 NaN OUTKAST#THROOTS~WUTANG#RUNDMC
col14
0 23.0
1 0.0
while Pyspark seems to parse this erroneously:
[5]:
sdf=spark.read.format("org.apache.spark.csv").csv('malformed_data.csv',header=True)
[6]: sdf.select("col12","col13",'col14').show()
+------------------+--------------------+--------------------+
| col12| col13| col14|
+------------------+--------------------+--------------------+
|"32 XIY ""W"" JK| RE LK"|SOMETHINGLIKEAPHE...|
| null|OUTKAST#THROOTS~W...| 0.0|
+------------------+--------------------+--------------------+
Is this a bug or am I doing something wrong ?
I am working with Spark 2.0
Any help is appreciated
Thanks,
-- Femi
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.
Re: Reading csv files with quoted fields containing embedded commas
Posted by Femi Anthony <fe...@gmail.com>.
The quote options seem to be related to escaping quotes and the dataset
isn't escaaping quotes. As I said quoted strings with embedded commas is
something that pandas handles easily, and even Excel does that as well.
Femi
On Sun, Nov 6, 2016 at 6:59 AM, Hyukjin Kwon <gu...@gmail.com> wrote:
> Hi Femi,
>
> Have you maybe tried the quote related options specified in the
> documentation?
>
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.
> html#pyspark.sql.DataFrameReader.csv
>
> Thanks.
>
> 2016-11-06 6:58 GMT+09:00 Femi Anthony <fe...@gmail.com>:
>
>> Hi, I am trying to process a very large comma delimited csv file and I am
>> running into problems.
>> The main problem is that some fields contain quoted strings with embedded
>> commas.
>> It seems as if PySpark is unable to properly parse lines containing such
>> fields like say Pandas does.
>>
>> Here is the code I am using to read the file in Pyspark
>>
>> df_raw=spark.read.option("header","true").csv(csv_path)
>>
>> Here is an example of a good and 'bad' line in such a file:
>>
>>
>> col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col
>> 12,col13,col14,col15,col16,col17,col18,col19
>> 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY
>> ""W"" JK, RE LK",SOMETHINGLIKEAPHENOMENON#Y
>> OUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023
>> -05-17,CODERED
>> 61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUT
>> KAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,201
>> 9-11-23,CODEBLUE
>>
>> Line 0 is the header
>> Line 1 is the 'problematic' line
>> Line 2 is a good line.
>>
>> Pandas can handle this easily:
>>
>>
>> [1]: import pandas as pd
>>
>> In [2]: pdf = pd.read_csv('malformed_data.csv')
>>
>> In [4]: pdf[['col12','col13','col14']]
>> Out[4]:
>> col12
>> col13 \
>> 0 32 XIY "W" JK, RE LK SOMETHINGLIKEAPHENOMENON#YOUG
>> OTSOUL~BRINGDANOISE
>> 1 NaN
>> OUTKAST#THROOTS~WUTANG#RUNDMC
>>
>> col14
>> 0 23.0
>> 1 0.0
>>
>>
>> while Pyspark seems to parse this erroneously:
>>
>> [5]: sdf=spark.read.format("org.apache.spark.csv").csv('malformed
>> _data.csv',header=True)
>>
>> [6]: sdf.select("col12","col13",'col14').show()
>> +------------------+--------------------+--------------------+
>> | col12| col13| col14|
>> +------------------+--------------------+--------------------+
>> |"32 XIY ""W"" JK| RE LK"|SOMETHINGLIKEAPHE...|
>> | null|OUTKAST#THROOTS~W...| 0.0|
>> +------------------+--------------------+--------------------+
>>
>> Is this a bug or am I doing something wrong ?
>> I am working with Spark 2.0
>> Any help is appreciated
>>
>> Thanks,
>> -- Femi
>>
>> http://www.nextmatrix.com
>> "Great spirits have always encountered violent opposition from mediocre
>> minds." - Albert Einstein.
>>
>
>
--
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.
Re: Reading csv files with quoted fields containing embedded commas
Posted by Hyukjin Kwon <gu...@gmail.com>.
Hi Femi,
Have you maybe tried the quote related options specified in the
documentation?
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv
Thanks.
2016-11-06 6:58 GMT+09:00 Femi Anthony <fe...@gmail.com>:
> Hi, I am trying to process a very large comma delimited csv file and I am
> running into problems.
> The main problem is that some fields contain quoted strings with embedded
> commas.
> It seems as if PySpark is unable to properly parse lines containing such
> fields like say Pandas does.
>
> Here is the code I am using to read the file in Pyspark
>
> df_raw=spark.read.option("header","true").csv(csv_path)
>
> Here is an example of a good and 'bad' line in such a file:
>
>
> col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,
> col12,col13,col14,col15,col16,col17,col18,col19
> 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY
> ""W"" JK, RE LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,
> cyclingstats,2012-25-19,432,2023-05-17,CODERED
> 61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,
> OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,
> 2019-11-23,CODEBLUE
>
> Line 0 is the header
> Line 1 is the 'problematic' line
> Line 2 is a good line.
>
> Pandas can handle this easily:
>
>
> [1]: import pandas as pd
>
> In [2]: pdf = pd.read_csv('malformed_data.csv')
>
> In [4]: pdf[['col12','col13','col14']]
> Out[4]:
> col12
> col13 \
> 0 32 XIY "W" JK, RE LK SOMETHINGLIKEAPHENOMENON#
> YOUGOTSOUL~BRINGDANOISE
> 1 NaN OUTKAST#THROOTS~WUTANG#RUNDMC
>
> col14
> 0 23.0
> 1 0.0
>
>
> while Pyspark seems to parse this erroneously:
>
> [5]: sdf=spark.read.format("org.apache.spark.csv").csv('
> malformed_data.csv',header=True)
>
> [6]: sdf.select("col12","col13",'col14').show()
> +------------------+--------------------+--------------------+
> | col12| col13| col14|
> +------------------+--------------------+--------------------+
> |"32 XIY ""W"" JK| RE LK"|SOMETHINGLIKEAPHE...|
> | null|OUTKAST#THROOTS~W...| 0.0|
> +------------------+--------------------+--------------------+
>
> Is this a bug or am I doing something wrong ?
> I am working with Spark 2.0
> Any help is appreciated
>
> Thanks,
> -- Femi
>
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>