You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2023/03/08 00:36:11 UTC

[spark] branch branch-3.4 updated: [SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
     new b3108279ffb [SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names
b3108279ffb is described below

commit b3108279ffb201364d480ab3a104e835001ef0b1
Author: Takuya UESHIN <ue...@databricks.com>
AuthorDate: Wed Mar 8 09:35:47 2023 +0900

    [SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names
    
    ### What changes were proposed in this pull request?
    
    Fixes `createDataFrame` to autogenerate missing column names.
    
    ### Why are the changes needed?
    
    Currently the number of the column names specified to `createDataFrame` does not match the actual number of columns, it raises an error:
    
    ```py
    >>> spark.createDataFrame([["a", "b"]], ["col1"])
    Traceback (most recent call last):
    ...
    ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements
    ```
    
    but it should auto-generate the missing column names.
    
    ### Does this PR introduce _any_ user-facing change?
    
    It will auto-generate the missing columns:
    
    ```py
    >>> spark.createDataFrame([["a", "b"]], ["col1"])
    DataFrame[col1: string, _2: string]
    ```
    
    ### How was this patch tested?
    
    Enabled the related test.
    
    Closes #40310 from ueshin/issues/SPARK-42022/columns.
    
    Authored-by: Takuya UESHIN <ue...@databricks.com>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
    (cherry picked from commit 056ed5dc67a660bf0808af6edc8668715b64d2a1)
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 python/pyspark/sql/connect/session.py                 | 7 +++++++
 python/pyspark/sql/tests/connect/test_parity_types.py | 5 -----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/python/pyspark/sql/connect/session.py b/python/pyspark/sql/connect/session.py
index 04e4f9e88d0..dd1c2d3c510 100644
--- a/python/pyspark/sql/connect/session.py
+++ b/python/pyspark/sql/connect/session.py
@@ -235,6 +235,10 @@ class SparkSession:
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+            elif isinstance(schema, (list, tuple)) and cast(int, _num_cols) < len(data.columns):
+                assert isinstance(_cols, list)
+                _cols.extend([f"_{i + 1}" for i in range(cast(int, _num_cols), len(data.columns))])
+                _num_cols = len(_cols)
 
             # Determine arrow types to coerce data when creating batches
             if isinstance(schema, StructType):
@@ -309,6 +313,9 @@ class SparkSession:
 
             _inferred_schema = self._inferSchemaFromList(_data, _cols)
 
+            if _cols is not None and cast(int, _num_cols) < len(_cols):
+                _num_cols = len(_cols)
+
             if _has_nulltype(_inferred_schema):
                 # For cases like createDataFrame([("Alice", None, 80.1)], schema)
                 # we can not infer the schema from the data itself.
diff --git a/python/pyspark/sql/tests/connect/test_parity_types.py b/python/pyspark/sql/tests/connect/test_parity_types.py
index 3d54d488a5d..67d5a17660e 100644
--- a/python/pyspark/sql/tests/connect/test_parity_types.py
+++ b/python/pyspark/sql/tests/connect/test_parity_types.py
@@ -90,11 +90,6 @@ class TypesParityTests(TypesTestsMixin, ReusedConnectTestCase):
     def test_infer_schema(self):
         super().test_infer_schema()
 
-    # TODO(SPARK-42022): createDataFrame should autogenerate missing column names
-    @unittest.skip("Fails in Spark Connect, should enable.")
-    def test_infer_schema_not_enough_names(self):
-        super().test_infer_schema_not_enough_names()
-
     # TODO(SPARK-42020): createDataFrame with UDT
     @unittest.skip("Fails in Spark Connect, should enable.")
     def test_infer_schema_specification(self):


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org