You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Conor Begley <co...@queryclick.com> on 2019/06/25 10:22:20 UTC

Potential Problem : Dropping malformed tables from CSV (PySpark)

Hey,

Currently working with CSV data and I've come across an unusual case. In
Databricks, if I use the following code:

GA_pages =
sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true', comment = '#', mode =
"DROPMALFORMED").load(ga_sessions_path)

display(GA_pages)

*OUTPUT:*
+------------------+--------+--------+--------------+---------+-----------+-------------+---------------------+------------+----------+--------------------------+
|User-Defined Value| Date|Sessions|% New Sessions|New Users|Bounce
Rate|Pages/Session|Avg. Session Duration|Transactions| Revenue|E-commerce
Conversion Rate|
+------------------+--------+--------+--------------+---------+-----------+-------------+---------------------+------------+----------+--------------------------+
| (not set)|20190508| 1,987| 60.09%| 1,194| 34.88%| 5.25| 00:03:34| 18|
£2,205.22| 0.91%| | (not set)|20190528| 1,910| 62.20%| 1,188| 36.07%| 4.53|
00:02:47| 9| £1,045.56| 0.47%| | (not set)|20190509| 1,723| 62.45%| 1,076|
34.07%| 5.39| 00:03:49| 13| £2,264.27| 0.75%| | (not set)|20190530| 1,682|
64.98%| 1,093| 31.87%| 5.39| 00:03:39| 23| £3,292.06| 1.37%| | (not
set)|20190527| 1,677| 63.33%| 1,062| 35.24%| 4.52| 00:03:02| 11| £2,261.22|
0.66%| | (not set)|20190511| 1,643| 62.08%| 1,020| 39.81%| 5.16| 00:03:06|
14| £2,412.03| 0.85%| | (not set)|20190529| 1,623| 61.24%| 994| 35.55%|
4.87| 00:03:05| 16| £3,213.60| 0.99%| | (not set)|20190513| 1,622| 68.13%|
1,105| 34.77%| 4.81| 00:03:07| 8| £1,660.61| 0.49%| | (not set)|20190505|
1,526| 59.90%| 914| 35.91%| 6.01| 00:03:48| 15| £3,022.73| 0.98%| | (not
set)|20190504| 1,515| 61.12%| 926| 38.81%| 5.58| 00:03:26| 16| £2,238.47|
1.06%| | (not set)|20190514| 1,513| 59.35%| 898| 34.10%| 4.98| 00:03:12|
14| £3,055.11| 0.93%| | (not set)|20190526| 1,509| 69.05%| 1,042| 32.94%|
5.01| 00:03:17| 9| £2,346.37| 0.60%| | (not set)|20190501| 1,504| 53.46%|
804| 36.90%| 4.91| 00:03:13| 14| £2,606.60| 0.93%| | (not set)|20190506|
1,502| 63.18%| 949| 36.55%| 4.97| 00:03:19| 16| £2,419.02| 1.07%| | (not
set)|20190517| 1,444| 58.86%| 850| 33.73%| 5.15| 00:03:13| 13| £3,650.09|
0.90%| | (not set)|20190521| 1,428| 58.40%| 834| 38.24%| 4.99| 00:03:37|
12| £1,718.84| 0.84%| | (not set)|20190502| 1,398| 58.44%| 817| 37.55%|
5.42| 00:03:40| 14| £2,835.89| 1.00%| | (not set)|20190503| 1,385| 56.03%|
776| 37.83%| 4.98| 00:03:06| 4| £763.65| 0.29%| | (not set)|20190507|
1,366| 59.66%| 815| 38.80%| 5.36| 00:03:30| 16| £2,094.68| 1.17%| | (not
set)|20190510| 1,357| 63.23%| 858| 34.49%| 5.3| 00:03:31| 10| £1,432.15|
0.74%| | (not set)|20190601| 1,355| 56.75%| 769| 36.38%| 5.31| 00:03:18|
20| £5,260.78| 1.48%| | (not set)|20190531| 1,332| 60.66%| 808| 36.86%|
5.09| 00:03:26| 13| £2,006.91| 0.98%| | (not set)|20190602| 1,332| 60.96%|
812| 37.16%| 5.08| 00:03:29| 17| £3,419.28| 1.28%| | (not set)|20190603|
1,329| 57.49%| 764| 36.19%| 5.11| 00:03:22| 11| £2,882.18| 0.83%| | (not
set)|20190515| 1,299| 59.12%| 768| 35.95%| 5.58| 00:03:57| 3| £118.00|
0.23%| | (not set)|20190604| 1,291| 60.42%| 780| 36.48%| 4.78| 00:03:02| 9|
£1,826.35| 0.70%| | (not set)|20190520| 1,274| 61.54%| 784| 35.32%| 5.44|
00:03:33| 14| £2,845.98| 1.10%| | (not set)|20190522| 1,256| 63.54%| 798|
37.18%| 4.71| 00:03:06| 13| £1,551.32| 1.04%| | (not set)|20190605| 1,254|
58.21%| 730| 36.04%| 4.85| 00:03:14| 10| £1,676.51| 0.80%| | (not
set)|20190512| 1,244| 64.63%| 804| 36.82%| 5.44| 00:03:15| 12| £1,823.22|
0.96%| | (not set)|20190518| 1,201| 62.03%| 745| 36.47%| 5.46| 00:03:32|
12| £1,765.69| 1.00%| | (not set)|20190516| 1,177| 60.92%| 717| 33.98%|
5.39| 00:03:51| 16| £2,851.27| 1.36%| | (not set)|20190524| 1,137| 61.39%|
698| 40.55%| 4.76| 00:02:52| 10| £1,562.80| 0.88%| | (not set)|20190525|
1,110| 63.87%| 709| 34.14%| 5.83| 00:03:46| 15| £2,520.85| 1.35%| | (not
set)|20190523| 1,105| 61.27%| 677| 37.56%| 4.66| 00:03:22| 10| £2,472.89|
0.90%| | (not set)|20190519| 1,081| 64.20%| 694| 32.01%| 5.93| 00:03:35|
10| £2,341.90| 0.93%| | null| null| 51,091| 61.21%| 31,272| 35.99%| 5.15|
00:03:22| 460|£83,464.11| 0.90%|
+------------------+--------+--------+--------------+---------+-----------+-------------+---------------------+------------+----------+--------------------------+


The code does as expected and the malformed data is dropped

However if I carry out filters on the data before displaying, the malformed
data reappears

GA_pages =
sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true', comment = '#', mode =
"DROPMALFORMED").load(ga_sessions_path)

GA_pages = GA_pages.withColumn('date', fn.col('Date'))\
  .withColumn('sessions', fn.regexp_replace("Sessions", r',', ""))\
  .withColumn('pages/session', fn.col('Pages/Session'))\
  .withColumn('pages', (fn.col('pages/session')*fn.col('sessions')))\
  .select('date', 'sessions', 'pages', 'pages/session')\

display(GA_pages)

*OUTPUT:*
+--------+-------------+------------------+-------------+ | date| sessions|
pages|pages/session|
+--------+-------------+------------------+-------------+ |20190508| 1987|
10431.75| 5.25| |20190528| 1910| 8652.300000000001| 4.53| |20190509| 1723|
9286.97| 5.39| |20190530| 1682| 9065.98| 5.39| |20190527| 1677|
7580.039999999999| 4.52| |20190511| 1643| 8477.880000000001| 5.16|
|20190529| 1623| 7904.01| 4.87| |20190513| 1622| 7801.82| 4.81| |20190505|
1526| 9171.26| 6.01| |20190504| 1515| 8453.7| 5.58| |20190514| 1513|
7534.740000000001| 4.98| |20190526| 1509| 7560.089999999999| 5.01|
|20190501| 1504| 7384.64| 4.91| |20190506| 1502| 7464.94| 4.97| |20190517|
1444| 7436.6| 5.15| |20190521| 1428| 7125.72| 4.99| |20190502| 1398|
7577.16| 5.42| |20190503| 1385| 6897.3| 4.98| |20190507| 1366| 7321.76|
5.36| |20190510| 1357| 7192.099999999999| 5.3| |20190601| 1355|
7195.049999999999| 5.31| |20190531| 1332| 6779.88| 5.09| |20190602| 1332|
6766.56| 5.08| |20190603| 1329|6791.1900000000005| 5.11| |20190515| 1299|
7248.42| 5.58| |20190604| 1291|6170.9800000000005| 4.78| |20190520| 1274|
6930.56| 5.44| |20190522| 1256| 5915.76| 4.71| |20190605| 1254| 6081.9|
4.85| |20190512| 1244| 6767.360000000001| 5.44| |20190518| 1201| 6557.46|
5.46| |20190516| 1177| 6344.03| 5.39| |20190524| 1137| 5412.12| 4.76|
|20190525| 1110| 6471.3| 5.83| |20190523| 1105| 5149.3| 4.66| |20190519|
1081| 6410.33| 5.93| | null| 51091| 263118.65| 5.15|
|Sessions|Pages/Session| null| null| | 1,504| 4.91| null| null| | 1,398|
5.42| null| null| | 1,385| 4.98| null| null| | 1,515| 5.58| null| null| |
1,526| 6.01| null| null| | 1,502| 4.97| null| null| | 1,366| 5.36| null|
null| | 1,987| 5.25| null| null| | 1,723| 5.39| null| null| | 1,357| 5.30|
null| null| | 1,643| 5.16| null| null| | 1,244| 5.44| null| null| | 1,622|
4.81| null| null| | 1,513| 4.98| null| null| | 1,299| 5.58| null| null| |
1,177| 5.39| null| null| | 1,444| 5.15| null| null| | 1,201| 5.46| null|
null| | 1,081| 5.93| null| null| | 1,274| 5.44| null| null| | 1,428| 4.99|
null| null| | 1,256| 4.71| null| null| | 1,105| 4.66| null| null| | 1,137|
4.76| null| null| | 1,110| 5.83| null| null| | 1,509| 5.01| null| null| |
1,677| 4.52| null| null| | 1,910| 4.53| null| null| | 1,623| 4.87| null|
null| | 1,682| 5.39| null| null| | 1,332| 5.09| null| null| | 1,355| 5.31|
null| null| | 1,332| 5.08| null| null| | 1,329| 5.11| null| null| | 1,291|
4.78| null| null| | 1,254| 4.85| null| null| | 51,091| 5.15| null| null|
+--------+-------------+------------------+-------------+
But then if I cache the data frame before filtering, the malformed data is
ignored:

GA_pages =
sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true', comment = '#', mode =
"DROPMALFORMED").load(ga_sessions_path).cache()

GA_pages = GA_pages.withColumn('date', fn.col('Date'))\
  .withColumn('sessions', fn.regexp_replace("Sessions", r',', ""))\
  .withColumn('pages/session', fn.col('Pages/Session'))\
  .withColumn('pages', (fn.col('pages/session')*fn.col('sessions')))\
  .select('date', 'sessions', 'pages', 'pages/session')\

display(GA_pages)

*Output:*
+--------+--------+------------------+-------------+ | date|sessions|
pages|pages/session| +--------+--------+------------------+-------------+
|20190508| 1987| 10431.75| 5.25| |20190528| 1910| 8652.300000000001| 4.53|
|20190509| 1723| 9286.97| 5.39| |20190530| 1682| 9065.98| 5.39| |20190527|
1677| 7580.039999999999| 4.52| |20190511| 1643| 8477.880000000001| 5.16|
|20190529| 1623| 7904.01| 4.87| |20190513| 1622| 7801.82| 4.81| |20190505|
1526| 9171.26| 6.01| |20190504| 1515| 8453.7| 5.58| |20190514| 1513|
7534.740000000001| 4.98| |20190526| 1509| 7560.089999999999| 5.01|
|20190501| 1504| 7384.64| 4.91| |20190506| 1502| 7464.94| 4.97| |20190517|
1444| 7436.6| 5.15| |20190521| 1428| 7125.72| 4.99| |20190502| 1398|
7577.16| 5.42| |20190503| 1385| 6897.3| 4.98| |20190507| 1366| 7321.76|
5.36| |20190510| 1357| 7192.099999999999| 5.3| |20190601| 1355|
7195.049999999999| 5.31| |20190531| 1332| 6779.88| 5.09| |20190602| 1332|
6766.56| 5.08| |20190603| 1329|6791.1900000000005| 5.11| |20190515| 1299|
7248.42| 5.58| |20190604| 1291|6170.9800000000005| 4.78| |20190520| 1274|
6930.56| 5.44| |20190522| 1256| 5915.76| 4.71| |20190605| 1254| 6081.9|
4.85| |20190512| 1244| 6767.360000000001| 5.44| |20190518| 1201| 6557.46|
5.46| |20190516| 1177| 6344.03| 5.39| |20190524| 1137| 5412.12| 4.76|
|20190525| 1110| 6471.3| 5.83| |20190523| 1105| 5149.3| 4.66| |20190519|
1081| 6410.33| 5.93| | null| 51091| 263118.65| 5.15|
+--------+--------+------------------+-------------+

I assume this isn't the intended behavior for mode = 'DROPMALFORMED'. It
appears that with Databricks lazy execution, somehow the 'mode =
'DROPMALFORMED' is ignored

I've included the CSV file I have been using

*(It's working noting that, in order for this error to be repeatable, if
.cache() has been used, the Databricks cluster must be restarted. For some
reason .cache() doesn't appear to be cleared by clearing state and results
)*

Regards,
-- 

*Conor Begley*   |   *QUERY*CLICK
*Software Developer Intern*

e: conor.begley@queryclick.com <%2...@queryclick.com>
dd: +44 (0)131 516 5251
ed: +44 (0)131 556 7078 // lon: +44 (0)207 183 0344

[image: QueryClick is a Healthy Working Lives Bronze Award Winner.]

Feed your online marketing muse & join us on Google+
<https://plus.google.com/+Queryclick/posts>, Twitter
<https://www.twitter.com/queryclick>, LinkedIn
<http://www.linkedin.com/company/queryclick> and Facebook
<https://www.facebook.com/queryclick.search.marketing>.

QueryClick Ltd (c/o reg: SC342868 & VAT GB 935518900) is registered in
Scotland with offices at The Stamp Office, 10 Waterloo Place, Edinburgh,
EH1 3EG & 1 Mark Square, London, EC2A 4EG.

Advice or opinion voiced in this correspondence are those of the author and
may not represent those of QueryClick Ltd. Information in this email and
any attachments are confidential and intended for the named recipients only.

Performance, improved. http://uk.queryclick.com/