You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by da...@apache.org on 2015/09/02 22:36:40 UTC

spark git commit: [SPARK-10417] [SQL] Iterating through Column results in infinite loop

Repository: spark
Updated Branches:
  refs/heads/master 2da3a9e98 -> 6cd98c187


[SPARK-10417] [SQL] Iterating through Column results in infinite loop

`pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance)

Issue reproduction:
```
df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))
for i in df["name"]: print i
```

Author: 0x0FFF <pr...@gmail.com>

Closes #8574 from 0x0FFF/SPARK-10417.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6cd98c18
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6cd98c18
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6cd98c18

Branch: refs/heads/master
Commit: 6cd98c1878a9c5c6475ed5974643021ab27862a7
Parents: 2da3a9e
Author: 0x0FFF <pr...@gmail.com>
Authored: Wed Sep 2 13:36:36 2015 -0700
Committer: Davies Liu <da...@gmail.com>
Committed: Wed Sep 2 13:36:36 2015 -0700

----------------------------------------------------------------------
 python/pyspark/sql/column.py | 3 +++
 python/pyspark/sql/tests.py  | 9 +++++++++
 2 files changed, 12 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/6cd98c18/python/pyspark/sql/column.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/column.py b/python/pyspark/sql/column.py
index 0948f9b..56e75e8 100644
--- a/python/pyspark/sql/column.py
+++ b/python/pyspark/sql/column.py
@@ -226,6 +226,9 @@ class Column(object):
             raise AttributeError(item)
         return self.getField(item)
 
+    def __iter__(self):
+        raise TypeError("Column is not iterable")
+
     # string methods
     rlike = _bin_op("rlike")
     like = _bin_op("like")

http://git-wip-us.apache.org/repos/asf/spark/blob/6cd98c18/python/pyspark/sql/tests.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index fc77863..eb449e8 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -1066,6 +1066,15 @@ class SQLTests(ReusedPySparkTestCase):
         keys = self.df.withColumn("key", self.df.key).select("key").collect()
         self.assertEqual([r.key for r in keys], list(range(100)))
 
+    # regression test for SPARK-10417
+    def test_column_iterator(self):
+
+        def foo():
+            for x in self.df.key:
+                break
+
+        self.assertRaises(TypeError, foo)
+
 
 class HiveContextSQLTests(ReusedPySparkTestCase):
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org