You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Steven Landes (JIRA)" <ji...@apache.org> on 2017/06/07 20:11:18 UTC
[jira] [Created] (SPARK-21011) RDD filter can combine/corrupt
columns
Steven Landes created SPARK-21011:
-------------------------------------
Summary: RDD filter can combine/corrupt columns
Key: SPARK-21011
URL: https://issues.apache.org/jira/browse/SPARK-21011
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 2.1.0
Reporter: Steven Landes
I used PySpark to read in some CSV files (actually separated by backspace, might be relevant). The resulting dataframe.show() gives me good data - all my columns are there, everything's great.
df = spark.read.option('delimiter', '\b').csv('<some S3 location>')
df.show() # all is good here
Now, I want to filter this bad boy... but I want to use RDD's filters because they're just nicer to use.
my_rdd = df.rdd
my_rdd.take(5) #all my columns are still here
filtered_rdd = my_rdd.filter(<some filter criteria here>)
filtered_rdd.take(5)
My filtered_rdd is missing a column. Specifically, _c2 has been mashed in to _c1.
Here's a relevant record (anonymized) from the df.show():
|3 |Text Field |12345|<some alphanumeric ID mess here>|150.00|UserName|2012-08-14 00:50:00|2015-02-24 01:23:45|2017-02-34 13:02:33|true|false|
...and the return from the filtered_rdd.take()
Row(_c0=u'3', _c1=u'"Text Field"\x08"12345"', _c2=u'|<some alphanumeric ID mess here>', _c3=u'150.00', _c4=u'UserName', _c5=u'2012-08-14 00:50:00', _c6=u'2015-02-24 01:23:45', _c7=u'2017-02-34 13:02:33', _c8=u'true', _c9=u'false', _c10=None)
Look at _c1 there - it's been mishmashed together with what was formerly _c2... and poor old _c10 is left without a value.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org