You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/30 10:57:04 UTC
[GitHub] [spark] zhengruifeng commented on a diff in pull request #39313: [SPARK-41745][SPARK-41789] Make `createDataFrame` support list of Rows

zhengruifeng commented on code in PR #39313:
URL: https://github.com/apache/spark/pull/39313#discussion_r1059340299


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -387,6 +388,30 @@ def test_with_local_list(self):
         with self.assertRaises(SparkConnectException):
             self.connect.createDataFrame(data, "col1 int, col2 int, col3 int").show()
 
+    def test_with_local_rows(self):
+        # SPARK-41789: Test creating a dataframe with list of Rows
+        data = [
+            Row(course="dotNET", year=2012, earnings=10000),
+            Row(course="Java", year=2012, earnings=20000),
+            Row(course="dotNET", year=2012, earnings=5000),
+            Row(course="dotNET", year=2013, earnings=48000),
+            Row(course="Java", year=2013, earnings=30000),
+            Row(course="Scala", year=2022, earnings=None),
+        ]
+
+        sdf = self.spark.createDataFrame(data)
+        cdf = self.connect.createDataFrame(data)
+
+        self.assertEqual(sdf.schema, cdf.schema)
+        self.assert_eq(sdf.toPandas(), cdf.toPandas())
+
+        # test with rename
+        sdf = self.spark.createDataFrame(data, schema=["a", "b", "c"])
+        cdf = self.connect.createDataFrame(data, schema=["a", "b", "c"])
+
+        self.assertEqual(sdf.schema, cdf.schema)
+        self.assert_eq(sdf.toPandas(), cdf.toPandas())
+

Review Comment:
   can not support nested rows for now:  https://issues.apache.org/jira/browse/SPARK-41746
   ```
   Traceback (most recent call last):
     File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/tests/connect/test_connect_basic.py", line 427, in test_with_local_nested_rows
       cdf = self.connect.createDataFrame(data)
     File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/session.py", line 249, in createDataFrame
       table = pa.Table.from_pandas(pdf)
     File "pyarrow/table.pxi", line 3475, in pyarrow.lib.Table.from_pandas
     File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 611, in dataframe_to_arrays
       arrays = [convert_column(c, f)
     File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 611, in <listcomp>
       arrays = [convert_column(c, f)
     File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 598, in convert_column
       raise e
     File "/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 592, in convert_column
       result = pa.array(col, type=type_, from_pandas=True, safe=safe)
     File "pyarrow/array.pxi", line 316, in pyarrow.lib.array
     File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
     File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
   pyarrow.lib.ArrowTypeError: ("Expected bytes, got a 'int' object", 'Conversion failed for column 1 with type object')
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org