You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/08 14:45:59 UTC

[GitHub] [spark] tomvanbussel commented on a diff in pull request #38979: [SPARK-41446][CONNECT][PYTHON] Make `createDataFrame` support schema and more input dataset types

tomvanbussel commented on code in PR #38979:
URL: https://github.com/apache/spark/pull/38979#discussion_r1043433892


##########
python/pyspark/sql/connect/session.py:
##########
@@ -264,9 +294,67 @@ def createDataFrame(self, data: "pd.DataFrame") -> "DataFrame":
 
         """
         assert data is not None
-        if len(data) == 0:
+        if isinstance(data, DataFrame):
+            raise TypeError("data is already a DataFrame")
+        if isinstance(data, Sized) and len(data) == 0:
             raise ValueError("Input data cannot be empty")
-        return DataFrame.withPlan(plan.LocalRelation(data), self)
+
+        struct: Optional[StructType] = None
+        column_names: List[str] = []
+
+        if isinstance(schema, StructType):
+            struct = schema
+            column_names = struct.names
+
+        elif isinstance(schema, str):
+            struct = _parse_datatype_string(schema)  # type: ignore[assignment]

Review Comment:
   I'm not sure if this can be used here, as `_parse_datatype_string` internally calls into the JVM. I think we have to add an option to the `LocalRelation` to pass a schema string instead.



##########
python/pyspark/sql/connect/session.py:
##########
@@ -264,9 +294,67 @@ def createDataFrame(self, data: "pd.DataFrame") -> "DataFrame":
 
         """
         assert data is not None
-        if len(data) == 0:
+        if isinstance(data, DataFrame):
+            raise TypeError("data is already a DataFrame")
+        if isinstance(data, Sized) and len(data) == 0:
             raise ValueError("Input data cannot be empty")
-        return DataFrame.withPlan(plan.LocalRelation(data), self)
+
+        struct: Optional[StructType] = None
+        column_names: List[str] = []
+
+        if isinstance(schema, StructType):
+            struct = schema
+            column_names = struct.names

Review Comment:
   This will ignore the names of nested fields, and it will ignore the types. To me it seems that we should leave the Pandas DataFrame untouched here, and instead pass the schema struct in the `LocationRelation` message to the driver.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org